THIS IS A DRAFT.

IF YOU ARE REVIEWING THIS BOOK, PLEASE PROVIDE FEEDBACK FOR EACH PAGE AT THIS LINK

About this handbook

This is a free open-access R reference manual for applied epidemiologists and public health practitioners.

This book strives to:

  • Serve as a quick reference manual - not as a textbook or comprehensive R training
  • Address common epidemiological problems via task-centered examples
  • Be accessible in settings with low internet-connectivity via this an offline version (instructions below)

What gaps does this book address?

  • Many epidemiologists are transitioning to R from SAS, STATA, SPSS, Excel, or other software
  • Let’s avoid hours of online searching and have a repository for best-practice code for the common epi user
  • Epidemiologists sometimes work in low internet-connectivity environments and have limited support

How is this different than other R books?

  • It is written by epidemiologists, for epidemiologists - leveraging experience in local, national, academic, and emergency settings
  • It provides examples of epidemic curves, transmission chains, epidemic modeling and projections, age and sex pyramids and standardization, record matching, outbreak detection, survey analysis, causal diagrams, survival analysis, GIS basics, phylogenetic trees, automated reports, etc…

How to read this handbook

Online version

  • Search via the search box above the Table of Contents
  • Click the “copy” icons to copy code
  • See the “Resources” section of each page for further resources

To download the offline version follow these steps:

  1. Click on this “offline_long.html” file in our Github repository
  2. Click the “Download” button. A new browser window will open with HTML source code
  3. Right-click (windows) on the webpage or Cmd-s (mac) to “Save As” the webpage - ensure the file type is “Webpage, Complete”
  4. The file will download. It is large (>40MB), so when opened the content may take time to appear.
  5. It displays as one long page - search with Ctrl+f (Cmd-f)

Edit or contribute

We welcome your feedback or comments. If you want to directly contribute or modify content, please post an issue or submit a pull request at this github repository.

Acknowledgements

Contributors

This book is produced by a collaboration of epidemiologists from around the world, drawing upon experiences with organizations including local/state/provincial/national health departments and ministries, the World Health Organization (WHO), MSF (Medecins sans frontiers / Doctors without Borders), hospital systems, and academic institutions.

Editor-in-Chief: Neale Batra

Core team: Neale Batra, Alex Spina, Amrish Baidjoe, Pat Keating, Henry Laurenson-Schafer, Finlay Campbell

Authors: Neale Batra, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Isaac Florence, Natalie Fischer, Daniel Molling, Liza Coyer, Jonny Polonski, Yurie Izawa, Sara Hollis, Isha Berry

Reviewers:

Advisers:

Funding and programmatic support

The handbook project received funding via a COVID-19 emergency capacity-building grant from Training Programs in Epidemiology and Public Health Interventions Network (TEPHINET).

Programmatic support was provided by the EPIET Alumni Network (EAN) and MSF’s Manson Unit.

Inspiration

The multitude of tutorials and vignettes that provided foundational knowledge for development of handbook content are credited within their respective pages.

More generally, the following sources provided inspiration and laid the groundwork for this handbook:
The “R4Epis” project (a collaboration between MSF and RECON)
R Epidemics Consortium (RECON)
R for Data Science book (R4DS)
bookdown: Authoring Books and Technical Documents with R Markdown
Netlify hosts this website

Image credits

Logo (US CDC Public Health Image Library):
2013 Yemen looking for mosquito breeding sites
Ebola virus
Survey in Rajasthan

License and Terms of Use

This handbook is not an approved product of any specific organization.

Although we strive for accuracy, we provide no guarantee of the content in this book.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

I About this book

Style and editorial notes

In this page we describe the philosophical approach, style, and specific editorial decisions made during the creation of this handbook.

Approach and style

The potential audience for this book is large. It will surely be used by people very new to R, and also by experienced R users looking for best practices and tips. So it must be both accessible and succinct. Therefore, our approach was to provide just enough text explanation that someone very new to R can apply the code and follow what is the code is doing.

A few other points:

  • This is a code reference book accompanied by relatively brief examples - not a thorough textbook on R or data science
  • This is a R handbook for use within applied epidemiology - not a manual on the methods or science of applied epidemiology

Packages

So many choices

One of the most challenging aspects of learning R is knowing which R package to use for a given task. It is a common occurrence to struggle through a tasks only later to realize - hey, there’s an R package that does all that in one command line!

In this handbook, we try to offer you at least two ways to complete each task: one tried-and-true method (probably in base R or tidyverse) and one special R package that is custom-built for that purpose. We want you to have a couple options in case you can’t download a given package or it otherwise does not work for you.

In choosing which packages to use, we prioritized R packages and approaches that have been tested and vetted by the community, would be friendly to beginners, that are stable (not changing very often), and that accomplish the task with minimal adaptation.

This handbook generally prioritizes R packages and functions from the tidyverse. Tidyverse is a collection of R packages designed for data science that share underlying grammar and data structures. All tidyverse packages can be installed or loaded via the tidyverse package. Read more at the tidyverse website.

When applicable, we also offer code options using base R - the packages and functions that come with R at installation. This is because we recognize that some in this book’s audience may not have reliable internet to download extra packages.

*Linking functions to packages explicitly

It is often frustrating in R tutorials when a function is show in code, but you don’t know which package it is from! We try to avoid this situation.

In the narrative text, package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). We strive to be explicit about which package a function comes from, either by referencing the package in nearby text or by specifying the package explicitly in the code like this: dplyr::mutate(). It may look redundant, but we are doing it on purpose.

See the page on R basics to learn more about packages and functions.

Code style

In the handbook, we frequently utilize “new lines”, making our code appear “long”. We do this for a few reasons:

  • We can write explanatory comments with # that are adjacent to each little part of the code
  • Generally speaking, longer (vertical) code is easier to read
  • This way is easier to read on a narrow screen (no sideways scrolling needed)
  • From the indentation, it should be easier to tell which arguments belong to which function

As a result, code that could be written like this:

linelist %>% 
  group_by(name) %>%                    # group the rows by 'name'
  slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row

…is written like this:

linelist %>% 
  group_by(name) %>%   # group the rows by 'name'
  slice_max(
    date,              # keep row per group with maximum date value 
    n = 1,             # keep only the single highest row 
    with_ties = F)     # if there's a tie (of date), take the first row

R code is generally not affected by new lines or indentations. In fact, when writing code, if you initiate a new line after a comma it will apply automatic indentation patterns.

We also use lots of spaces (e.g. n = 1 instead of n=1) because it is easier to read. Be nice to the people reading your code!

Notes

Here are the types of notes you may encounter in the handbook:

NOTE: This is a note TIP: This is a tip. CAUTION: This is a cautionary note. DANGER: This is a warning.

Editorial decisions

Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool for consideration, please join/start a conversation on our Github page.

Table of package, function, and other editorial decisions

Subject Considered Outcome & date Brief rationale
Epiweeks aweek, lubridate lubridate, Dec 2020 consistency, package maintenance prospects
ggplot labels labs(), ggtitle(), ylab(), xlab() labs(), Feb 2021 All labels in one function/one place

Datasets used

Data used in this handbook are either simulated or publicly available. All the data can be downloaded from the “data” folder of our Github repository. Below are more details about some of the data:

  • The case linelist
    • A simulated Ebola outbreak, expanded by the handbook authors from the one in the outbreaks package
  • Aggregated counts
    • A simulated dataset of malaria counts by age, day, and facility in a fictional region, can be downloaded from the data folder noted above
  • Time series and outbreak detection
    • Campylobacter cases reported in Germany 2002-2011. Available from surveillance package
    • Climate data (temperature in degrees celsius and rain fail in millimetres) in Germany 2002-2011. Downloaded from the EU Copernicus satellite reanalysis dataset using the ecmwfr package.
  • GIS page shapefiles
    • Downloaded from the Humanitarian Data Exchange (HDX) - see link in the page
  • Phylogenetic tree data
    • Newick file of phylogenetic tree constructed from whole genome sequencing of 299 Shigella sonnei samples and corresponding sample data.
    • The Belgian samples and resulting data are kindly provided by the Belgian NRC for Salmonella and Shigella in the scope of a project conducted by an ECDC EUPHEM Fellow, and will also be published in a manuscript.
    • The international data are openly available on public databases (ncbi) and have been previously published.

II Basics

R Basics

Overview

This page is not intended to be a comprehensive “learn R” tutorial. However, it does cover some fundamentals that can be useful for reference or for refreshing your memory. See the section on recommended training for links to more comprehensive tutorials.

See the page on Transition to R for tips on switching to R from STATA and SAS.

Why use R?

As stated on the R project website, R is a programming language and environment for statistical computing and graphics. It is highly versatile, extensible, and community-driven.

Cost

R is free to use! There is a strong ethic in the community of free and open-source material.

Reproducibility

Conducting your data management and analysis through a programming language (compared to Excel or another primarily point-click/manual tool) enhances reproducibility, makes error-detection easier, and eases your workload.

Community

The R community of users is enormous and collaborative. New packages and tools to address real-life problems are developed daily, and vetted by the community of users. As one example, R-Ladies is a worldwide organization whose mission is to promote gender diversity in the R community, and is one of the largest organizations of R users. It likely has a chapter near you!

Installation

How to install R

Visit this website https://www.r-project.org/ and download the latest version of R suitable for your computer.

How to install R Studio

Visit this website https://rstudio.com/products/rstudio/download/ and download the latest free Desktop version of RStudio suitable for your computer.

How to update R and RStudio

Your version of R is printed to the R Console at start-up. You can also run sessionInfo().

To update R, go to the website mentioned above and re-install R. Alternatively, you can use the installr package (on Windows) by running installr::updateR(). This will open dialog boxes to help you download the latest R version and update your packages to the new R version. More details can be found in the installr documentation.

Be aware that the old R version will still exist in your computer. You can temporarily run an older version (older “installation”) of R by clicking “Tools” -> “Global Options” in RStudio and choosing an R version. This can be useful if you want to use a package that has not been updated to work on the newest version of R.

To update RStudio, you can go to the website above and re-download RStudio. Another option is to click “Help” -> “Check for Updates” within RStudio, but this may not show the very latest updates.

Other software you may need to install

  • TinyTeX (for compiling an RMarkdown document to PDF)
  • Pandoc (for compiling RMarkdown documents)
  • RTools (for building packages for R)
  • phantomjs (for saving still images of animated networks, such as transmission chains)

TinyTex

TinyTex is a custom LaTeX distribution, useful when trying to produce PDFs from R.
See https://yihui.org/tinytex/ for more informaton.

To install TinyTex from R:

install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()

Pandoc

Pandoc is a document converter, a separate software from R. It comes bundled with RStudio and should not need to be downloaded. It helps the process of converting Rmarkdown documents to formats like .pdf and adding complex functionality.

RTools

RTools is a collection of software for building packages for R

Install from this website: https://cran.r-project.org/bin/windows/Rtools/

phantomjs

This is often used to take “screenshots” of webpages. For example when you make a transmission chain with epicontacts package, an HTML file is produced that is interactive and dynamic. If you want a static image, if can be useful to use the webshot package to automate this process. This will require the external program “phantomjs”. You can install phantomjs via the webshot package with the command webshot::install_phantomjs().

RStudio

RStudio Orientation

First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.

For RStudio to function you must also have R installed on the computer (see this section for installation instructions).

RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!

By default RStudio displays four rectangle panes.

TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.

The R Console Pane

The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.

If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.

The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.

For Stata users, this pane is similar to your Do-file and Data Editor windows.

The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.

In Stata, this is most similar to Variables Manager window.

This pane also contains History where can see commands that you can previously. It also has a “Tutorial” tab where you can complete interactive R tutorials if you have the learnr package installed.

Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).

This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.

RStudio settings

Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options. There you can change the default settings, including appearance/background color.

Restart

If your R freezes, you can re-start R by going to the Session menu and clicking “Restart R”. The avoids the hassle of closing and opening RStudio. Everything in your R environment will be removed when you do this.

Scripts

Scripts are a fundamental part of programming. Storing your code in a script (vs. typing in the console) has many advantages:

  • Reproducibility - so that other can know exactly what you did (and what you might have done wrong!)
  • Version control - so you can track changes made by yourself or colleagues
  • Commenting/annotation - to explain to your colleagues what you have done

Below is an example of a short R script. Remember, the better you succinctly explain your code in comments, the more your colleagues will like you!

R markdown

Rmarkdown is a type of script in which the script itself becomes a document (PDF, Word, HTML, Powerpoint, etc.). See the handbook page on R Markdown documents.

R notebooks

There is no difference between writing in a Rmarkdown vs an R notebook. However the execution of the document differs slightly. See this site for more details.

Shiny

Shiny apps/websites are contained within one script, which must be named app.R. This file has three components:

  1. A user interface (ui)
  2. A server function
  3. A call to the shinyApp function

See the handbook page on Shiny and dashboards, or this online tutorial: Shiny tutorial

In older times, the above file was split into two files (ui.R and server.R)

Working directory

The working directory is the root folder location used by R for your work - where R looks for and saves files by default. By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.

The working directory appears in grey text at the top of the RStudio Console pane. You can also return the current working directory with getwd() (leave the parentheses empty).

See the page on R projects for details on our recommended approach to managing your working directory. This common, efficient, and trouble-free way to use R is to combine these 3 elements: An R project to store all your files, the here package to locate files, and the rio package to import/export files.

Set by command

Although we do not recommend this approach in most circumstances, you can use the command setwd() with the desired folder file path in quotations, for example:

setwd("C:/Documents/R Files/My analysis")

Set manually

To set the working directory manually (point-and-click), click the Session drop-down menu and go to “Set Working Directory” and then “Choose Directory”. This will set the working directory for that specific R session (if using this approach, you will have to do this each time you open RStudio).

Within an R project

If using an R project, the working directory will default to the R project root folder that contains the “.rproj” file. This will apply if you open RStudio by clicking open the R project (the file with “.rproj” extension))

Working directory in an R markdown

In an R markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved within. If using an R project and here package, this does not apply and the working directory will be here() as explained in the R projects page.

If you want to change the working directory of a stand-alone R markdown (not in an R project), if you use setwd() this will only apply to that specific code chunk. To make the change for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:

knitr::opts_knit$set(root.dir = 'desired/directorypath')

It is much easier to just use the R markdown within an R project and use the here package.

Providing file paths

Perhaps the most common source of frustration for an R beginner (at least on a Windows machine) is typing in a filepath to import data. Note the following:

Slash direction - If typing in a filepath, beware the direction of the slashes. Enter them using forward slashes to separate the components (“data/provincial.csv”). For Windows users, the default way that filepaths are displayed and copied is with backslashes (“\”) - so this means you will need to change the direction of each slash.

If you use the here package as described in R projects the slash direction is not an issue.

Broken paths -

Below is an example of an “absolute” or “full address” filepath. These will likely break if used by another computer.

C:/Users/Name/Document/Analytic Software/R/Projects/Analysis2019/data/March2019.csv  

In most situations, we recommend using “relative” filepaths instead - that is, the path relative to the root of an R project. You can do this using the here package as explained in the R projects page.

One possible exception to this is if you need to load data from folder locations outside of an R project. In this case you can still use an R project and relative file paths for your scripts and outputs, but you may need to use an absolute file path to import these data.

Objects

Everything in R is an object. These sections will explain:

  • How to create objects (<-)
  • Types of objects (e.g. data frames, vectors..)
  • How to access subparts of objects (e.g. variables in a dataset)
  • Classes of objects (e.g. numeric, logical, integer, double, character, factor)

Everything is an object

Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.

An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.

Defining objects (<-)

Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:

object_name <- value (or process/calculation that produce a value)

EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object current_week is created when it is assigned the character value "2018-W10" (the quote marks make these a character value).
The object current_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.

See the R commands and their output in the boxes below.

current_week <- "2018-W10"   # this command creates the object current_week by assigning it a value
current_week                 # this command prints the current value of current_week object in the console
## [1] "2018-W10"

NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output

CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.

The following command will re-define the value of current_week:

current_week <- "2018-W51"   # assigns a NEW value to the object current_week
current_week                 # prints the current value of current_week in the console
## [1] "2018-W51"

Dataset

Datasets are also objects (typically “dataframes”) and must be assigned names when they are imported. In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package and its import() function.

# linelist is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")

You can read more about importing and exporting datasets with the section on Import and export.

CAUTION: A quick note on naming of objects:

  • Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
  • Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
  • Object names must begin with a letter (cannot begin with a number like 1, 2 or 3).

Object structure

Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.

The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.

In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:

Common structure Explanation Example
Vectors A container for a sequence of singular objects, all of the same class (e.g. numeric, character). “Variables” (columns) in data frames are vectors (e.g. the column age_years).
Data Frames Vectors (e.g. columns) that are bound together that all have the same number of rows. linelist is a data frame.

Note that to create a vector that “stands alone” (is not part of a data frame) the function c() is used to combine the different elements. For example, if creating a vector of colors plot’s color scale: list_of_names <- c("blue", "red2", "orange", "grey")

Object classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Explanation Examples
Character These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. “Character objects are in quotation marks”
Integer Numbers that are whole only (no decimals) -5, 14, or 2000
Numeric These are numbers and can include decimals. If within quotation marks the will be considered character. 23.1 or 14
Factor These are vectors that have a specified order or hierarchy of values Variable msf_involvement with ordered values N, S, SUB, and U.
Date Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
Logical Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) TRUE or FALSE
data.frame A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.
tibble tibbles are a variation on data frame, the main operational difference being that they print more nicely to the console (display first 10 rows and only columns that fit on the screen) Any data frame, list, or matrix can be converted to a tibble with as_tibble()
list A list is like vector, but holds other objects that can be other different classes A list could hold a single number, and a dataframe, and a vector, and even another list within it!

You can test the class of an object by providing its name to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.

class(linelist)         # class should be a data frame
## [1] "data.frame"
class(linelist$age)     # class should be numeric
## [1] "numeric"
class(linelist$gender)  # class should be character
## [1] "character"

Sometimes, a column will be converted to a different class automatically by R. Watch out for this! For example, if you have a vector or column of numbers, but a character value is inserted… the entire column will change to class character.

One common example of this is when manipulating a data frame in order to print a table - if you make a total row and try paste/glue together percents in the same cell as numbers (e.g. 23 (40%)), the entire numeric column above will convert to character and can no longer be used for mathematical calculations.

num_vector <- c(1,2,3,4,5) # define vector as all numbers
class(num_vector)          # vector is numeric class
## [1] "numeric"
num_vector[3] <- "three"   # convert the third element to a character
class(num_vector)          # vector is now character class
## [1] "character"

Sometimes, you will need to convert objects or columns to another class.

Function Action
as.character() Converts to character class
as.numeric() Converts to numeric class
as.integer() Converts to integer class
as.Date() Converts to Date class - Note: see section on dates for details
as.factor() Converts to factor - Note: re-defining order of value levels requires extra arguments

Likewise, there are base R functions to check whether an object IS of a specific class, such as is.numeric(), is.character(), is.double(), is.factor(), is.integer()

Here is more online material on classes and data structures in R.

Columns/Variables ($)

A column in a data frame is technically a “vector” (see table above) - a series of values that must all be the same class (either character, numeric, logical, etc).

A vector can exist independent of a data frame, for example a vector of column names that you want to include as explanatory variables in a model. To create a “stand alone” vector, use the c() function as below:

# define the stand-alone vector of character values
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

# print the values in this named vector
explanatory_vars
## [1] "gender" "fever"  "chills" "cough"  "aches"  "vomit"

Columns in a data frame are also vectors and can be called, referenced, extracted, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. In this handbook, we try to use the word “column” instead of “variable”.

# Retrieve the length of the vector age_years
length(linelist$age) # (age is a column in the linelist data frame)

By typing the name of the dataframe followed by $ you will also see a drop-down menu of all columns in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!

ADVANCED TIP: Some more complex objects (e.g. a list, or an epicontacts object) may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset

Access/index with brackets ([ ])

You may need to view parts of objects, also called “indexing”, which is often done using the square brackets [ ]. Using $ on a dataframe to access a column is also a type of indexing.

my_vector <- c("a", "b", "c", "d", "e", "f")  # define the vector
my_vector[5]                                  # print the 5th element
## [1] "e"

Square brackets also work to return specific parts of an returned output, such as the output of a summary() function:

# All of the summary
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     6.0    13.0    16.2    23.0    90.0      87
# Just one part of the summary
summary(linelist$age)[2]
## 1st Qu. 
##       6
# One part without its "name"
summary(linelist$age)[[2]]
## [1] 6

Brackets also work on data frames to view specific rows and columns. You can do this using the syntax dataframe[rows, columns]:

# View a specific row (2) from dataset, with all columns (don't forget the comma!)
linelist[2,]

# View all rows, but just one column
linelist[, "date_onset"]

# View values from row 2 and columns 5 through 10
linelist[2, 5:10] 

# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)] 

# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]

# View rows and columns based on criteria
# *** Note the dataframe must still be names in the criteria!
linelist[linelist$age > 25 , c("date_onset", "date_birth", "age")]

# Use View() to see the outputs in the RStudio Viewer pane (easier to read) 
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])

# Save as a new object
new_table <- linelist[2:20, c("date_onset")] 

When indexing an object of class list, single brackets always return with class list, even if only a single object is returned. Double brackets, however, can be used to access a single element and return a different class than list.
Brackets can also be written after one another, as demonstrated below.

This visual explanation of lists indexing, with pepper shakers is humorous and helpful.

# define demo list
my_list <- list(
  # First element in the list is a character vector
  hospitals = c("Central", "Empire", "Santa Anna"),
  
  # second element in the list is a data frame of addresses
  addresses   = data.frame(
    street = c("145 Medical Way", "1048 Brown Ave", "999 El Camino"),
    city   = c("Andover", "Hamilton", "El Paso")
    )
  )

Here is how the list looks when printed to the console. See how there are two named elements:

  • hospitals, a character vector
  • addresses, a data frame of addresses
my_list
## $hospitals
## [1] "Central"    "Empire"     "Santa Anna"
## 
## $addresses
##            street     city
## 1 145 Medical Way  Andover
## 2  1048 Brown Ave Hamilton
## 3   999 El Camino  El Paso

Now we extract, using various methods:

my_list[1] # this returns the element in class "list" - the element name is still displayed
## $hospitals
## [1] "Central"    "Empire"     "Santa Anna"
my_list[[1]] # this returns only the (unnamed) character vector
## [1] "Central"    "Empire"     "Santa Anna"
my_list[["hospitals"]] # you can also index by name of the list element
## [1] "Central"    "Empire"     "Santa Anna"
my_list[[1]][3] # this returns the third element of the "hospitals" character vector
## [1] "Santa Anna"
my_list[[2]][1] # This returns the first column ("street") of the address data frame
##            street
## 1 145 Medical Way
## 2  1048 Brown Ave
## 3   999 El Camino

Remove objects

You can remove individual objects from your R environment by putting the name in the rm() function (no quote marks):

rm(object_name)

You can remove all objects (clear your workspace) by running:

rm(list = ls(all = TRUE))

Functions

This section on functions explains:

  • What a function is and how they work
  • What arguments are
  • How to get help understanding a function

Simple functions

A function is like a machine that receives inputs, does some action with those inputs, and produces an output. What the output is depends on the function.

Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:

sqrt(49)
## [1] 7

The object provided to a function also can be a column in a dataset. For example, when the function summary() is applied to the numeric column age in the dataset linelist, the output is a summary of the columns’s numeric and missing values.

summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     6.0    13.0    16.2    23.0    90.0      87

NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.

Functions with multiple arguments

Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.

  • Some arguments are required for the function to work correctly, others are optional
  • Optional arguments have default settings
  • Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs

Here is a fun fictional function, called oven_bake(), as an example of a typical function. It takes an input object (e.g. a dataset, or in this example “dough”) and performs operations on it as specified by additional arguments (minutes = and temperature =). The output can be printed to the console, or saved as an object using the assignment operator <-.

In a more realistic example, the age_pyramid() command below produces an age pyramid plot based on defined age groups and a binary split column, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the dataframe to use, age_cat5 as the column to count, and gender as the binary column to use for splitting the pyramid by color.

# Create an age pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_cat5", split_by = "gender")

The above command can be equivalently written as below, with newlines. This can be easier to read and to write # comments. To run this command you can highlight the entire command, or just place your cursor in the first line and then press Ctrl and Enter keys simultaneously.

# Create an age pyramid
apyramid::age_pyramid(
  data = linelist,        # case linelist
  age_group = "age_cat5", # age group column
  split_by = "gender"     # two sides to pyramid
  )

The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.

# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_cat5", "gender")

A more complex age_pyramid() command might include the optional arguments to:

  • Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
  • Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)

NOTE: For arguments that you specify with both parts of the argument (e.g. proportional = TRUE), their order among all the arguments does not matter.

apyramid::age_pyramid(
  linelist,                    # use case linelist
  "age_cat5",                  # age group column
  "gender",                    # split by gender
  proportional = TRUE,         # percents instead of counts
  pal = c("orange", "purple")  # colors
  )

Packages

Packages contain functions.

An R package is a shareable bundle of code and documentation that contains pre-defined functions. Users in the R community develop and share packages all the time, so chances are likely that a solution exists for you! You will install and use hundreds of packages in your use of R.

On installation, R contains “base” packages and functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use. In this handbook, package names are written in bold. One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.

Install and load

Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command (from base R) at the beginning of each R session. Later in this section we advocate for use of p_load() instead of library(), it this is still a very common option.

Think of R as your personal library: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow that book from your library.

Your library

Your “library” is actually a folder on your computer, containing a folder for each package that has been installed. Find out where R is installed in your computer, and look for a folder called “win-library”. For example: R\win-library\4.0 (the 4.0 is the R version - you’ll have a different library for each R version you’ve downloaded). As a last-case measure, you can remove a package by manually deleting it from here (but it is better to use remove.packages("packagename")).

CRAN

CRAN (Comprehensive R Archive Network) is a public warehouse of R packages that have been published by R community members. Most often, R users download packages from CRAN.

Install vs. Load

To use a package, 2 steps must be implemented:

  1. The package must be installed (once), and
  2. The package must be loaded (each R session)

The basic function for installing a package is install.packages(), where the name of the package is provided in quotes. This can also be accomplished point-and-click by going to the RStudio “Packages” pane and clicking “Install” and typing the package name. Note all this is case-sensitive.

install.packages("tidyverse")

The basic function to load a package for use (after it has been installed) is library(), with the name of the package NOT in quotes.

library(tidyverse)

To check whether a package in installed or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If the box is checked, it is loaded for the current session.

Using pacman

This handbook emphasizes use of the package pacman (abbreviation for “package manager”), which offers the useful function p_load(). This function combines the above two steps into one - it installs and/or loads packages, depending on what is needed. If the package has not yet been installed, it will attempt to install from CRAN, and then load it.

Below, we load three of the packages often used in this R basics page:

pacman::p_load(tidyverse, rio, here)

Install from github

Sometimes, you need to install the development version of a package, from a Github repository. You can use p_load_gh() from pacman (this function is a “wrapper” around install_github() from devtools package).

In the examples below, the first name listed in the quotation marks is the Github ID of the repository owner, and after the slash is the name of the repository. If you want to install from a branch other than the main/master branch, add it after an “@”.

Of course, you have to load pacman before running any of its functions, or just specify the package name with two colons pacman:: - this loads the package in order to execute that function.

# install/load package from github repository
pacman::p_load_gh("reconhub/epicontacts")

# load development version of package which you had downloaded from github repository
pacman::p_load_gh("reconhub/epicontacts")

# install development version of package, but not the main branch
pacman::p_install_gh("reconhub/epicontacts@timeline")

Read more about pacman in this online vignette

Install from ZIP or TAR

You could install the package from a URL:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")

Or, download it to your computer in a zipped file:

Option 1: using install_local() from the remotes package

remotes::install_local("~/Downloads/dplyr-master.zip")

Option 2: using install.packages() from base R, providing the file path to the ZIP file and setting type = "source and repos = NULL.

install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source")

Code syntax

For clarity in this handbook, functions are sometimes preceded by the name of their package using the :: symbol in the following way: package_name::function_name()

Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However writing the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()). Writing the package name will also load the package if it is not already loaded.

# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")

Function help

To read more about a function, you can search for it in the Help tab of the lower-right RStudio. You can also run a command like ?thefunctionname (put the name of the function after a question mark) and the Help page will appear in the Help pane. Finally, try searching online for resources.

Update packages

You can update packages by re-installing them. You can also click the green “Update” button in your RStudio Packages pane to see which packages have new versions to install. Be aware that your old code may need to be updated if there is a major revision to how a function works!

Delete packages

Use p_delete() from pacman, or remove.packages() from base R. Alternatively, go find the folder which contains your library and manually delete the folder.

Dependencies

Packages often depend on other packages to work. These are called dependencies. If a dependency fails to install, then the package depending on it may also fail to install.

See the dependencies of a package with p_depends(), and see which packages depend on it with p_depends_reverse()

Masked functions

It is not uncommon that two or more packages contain the same function name. For example, the package dplyr has a filter() function, but so does the package stats. The default filter() function depends on the order these packages are first loaded in the R session - the later one will be the default for the command filter().

You can check the order in your Environment pane of R Studio - click the drop-down for “Global Environment” and see the order of the packages. Functions from packages lower on that drop-down list will mask functions of the same name in packages that appear higher in the drop-down list. When first loading a package, R will warn you in the console if masking is occurring, but this can be easy to miss.

Here are ways you can fix masking:

  1. Specify the package name in the command. For example, use dplyr::filter()
  2. Re-arrange the order in which the packages are loaded (e.g. within p_load()), and start a new R session

Detach / unload

To detach (unload) a package, use this command, with the correct package name and only one colon. Note that this may not resolve masking.

detach(package:PACKAGE_NAME_HERE, unload=TRUE)

Install older version

See this guide to install an older version of a particular package.

Suggested packages

See the page on Suggested packages for a listing of packages we recommend for everyday epidemiology.

Piping (%>%)

Two general approaches to working with objects are:

  1. Pipes/tidyverse - pipes send an object from function to function - emphasis is on the action, not the object
  2. Define intermediate objects - an object is re-defined again and again - emphasis is on the object

Pipes

Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.

  • Piping emphasizes a sequence of actions, not the object the actions are being performed on
  • Pipes are best when a sequence of actions must be performed on one object
  • Pipes come from the package magrittr, which is automatically included in packages dplyr and tidyverse
  • Pipes can make code more clean and easier to read, more intuitive

Read more on this approach in the tidyverse style guide

Here is a fake example for comparison, using fictional functions to “bake a cake”. First, the pipe method:

# A fake example of how to bake a care using piping syntax

cake <- flour %>%       # to define cake, start with flour, and then...
  left_join(eggs) %>%   # add eggs
  left_join(oil) %>%    # add oil
  left_join(water) %>%  # add water
  mix_together(         # mix together
    utensil = spoon,
    minutes = 2) %>%    
  bake(degrees = 350,   # bake
       system = "fahrenheit",
       minutes = 35) %>%  
  let_cool()            # let it cool down

Here is another link describing the utility of pipes.

Piping is not a base function. To use piping, the magrittr package must be installed and loaded (this is typically done by loading tidyverse or dplyr package). You can read more about piping in the magrittr documentation.

Note that just like other R commands, pipes can be used to just display the result, or to save/re-save an object, depending on whether the assignment operator <- is involved. See both below:

# Create or overwrite object, defining as aggregate counts by age category (not printed)
linelist_summary <- linelist %>% 
  count(age_cat)
# Print the table of counts in the console, but don't save it
linelist %>% 
  count(age_cat)

CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.

%<>%
This is an “assignment pipe” from the magritter package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, so object %<>% function() %>% function() is the same as object <- object %>% function() %>% function().

Define intermediate objects

This approach to changing objects/dataframes may be better if:

  • You need to manipulate multiple objects
  • There are intermediate steps that are meaningful and deserve separate object names

Risks:

  • Creating new objects for each step means creating lots of objects. If you use the wrong one you might not realize it!
  • Naming all the objects can be confusing
  • Errors may not be easily detectable

Either name each intermediate object, or overwrite the original, or combine all the functions together. All come with their own risks.

Below is the same fake “cake” example as above, but using this style:

# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)

batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)

cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)

cake <- let_cool(cake)

Combine all functions together - this is difficult to read:

# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))

Key operators and functions

This section details operators in R, such as:

  • Definitional operators
  • Relational operators (less than, equal too..)
  • Logical operators (and, or…)
  • Handling missing values
  • Mathematical operators and functions (+/-, >, sum(), median(), …)
  • The %in% operator

Assignment operators

<-

The basic assignment operator in R is <-. Such that object_name <- value.
This assignment operator can also be written as =. We advise use of <- for general R use.
We also advise surrounding such operators with spaces, for readability.

<<-

If Writing functions, or using R in an interactive way with sourced scripts, then you may need to use this assignment operator <<- (from base R). This operator is used to define an object in a higher ‘parent’ R Environment. See this online reference.

%<>%

This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, as shown below in two equivalent examples:

linelist <- linelist %>% 
  mutate(age_months = age_years * 12)

The above is equivalent to the below:

linelist %<>% mutate(age_months = age_years * 12)

%<+%

This is used to add data to phylogenetic trees with the ggtree package. See the page on Phylogenetic trees or this online resource book.

Relational and logical operators

Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:

Function Operator Example Example Result
Equal to == "A" == "a" FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Not equal to != 2 != 0 TRUE
Greater than > 4 > 2 TRUE
Less than < 4 < 2 FALSE
Greater than or equal to >= 6 >= 4 TRUE
Less than or equal to <= 6 <= 4 FALSE
Value is missing is.na() is.na(7) FALSE (see page on Missing data)
Value is not missing !is.na() !is.na(7) TRUE

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

Function Operator
AND &
OR | (vertical bar)
Parentheses ( ) Used to group criteria together and clarify order of operations

For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:

linelist_cleaned <- linelist %>%
  mutate(case_def = case_when(
    is.na(rdt_result) & is.na(other_case_in_home)            ~ NA_character_,
    rdt_result == "Positive"                                 ~ "Confirmed",
    rdt_result != "Positive" & other_cases_in_home == "Yes"  ~ "Probable",
    TRUE                                                     ~ "Suspected"
  ))
Criteria in example above Resulting value in new variable “case_def”
If the value for variables rdt_result and other_cases_in_home are missing NA (missing)
If the value in rdt_result is “Positive” “Confirmed”
If the value in rdt_result is NOT “Positive” AND the value in other_cases_in_home is “Yes” “Probable”
If one of the above criteria are not met “Suspected”

Note that R is case-sensitive, so “Positive” is different than “positive”…

Missing values

In R, missing values are represented by the special value NA (a “reserved” value) (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA. How to do this is addressed in the Import and export page.

To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.

rdt_result <- c("Positive", "Suspected", "Positive", NA)   # two positive cases, one suspected, and one unknown
is.na(rdt_result)  # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE  TRUE

Read more about missing, infinite, NULL, and impossible values in the page on Missing data. Learn how to convert missing values when importing data in the page on Import and export.

Mathematics and statistics

All the operators and functions in this page is automatically available using base R.

Mathematical operators

These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.

Objective Example in R
addition 2 + 3
subtraction 2 - 3
multiplication 2 * 3
division 30 / 5
exponent 2^3
order of operations ( )

Mathematical functions

Objective Function
rounding round(x, digits = n)
rounding janitor::round_half_up(x, digits = n)
ceiling (round up) ceiling(x)
floor (round down) floor(x)
absolute value abs(x)
square root sqrt(x)
exponent exponent(x)
natural logarithm log(x)
log base 10 log10(x)
log base 2 log2(x)

Scientific notation

To turn off scientific notation in your R session, run this command:

options(scipen=999)

Rounding

DANGER: round() uses “banker’s rounding” which rounds up from a .5 only if the upper number is even. Use round_half_up() from janitor to consistently round halves up to the nearest whole number. See this explanation

# use the appropriate rounding function for your work
round(c(2.5, 3.5))
## [1] 2 4
janitor::round_half_up(c(2.5, 3.5))
## [1] 3 4

Statistical functions:

CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm=TRUE is specified

Objective Function
mean (average) mean(x, na.rm=T)
median median(x, na.rm=T)
standard deviation sd(x, na.rm=T)
quantiles* quantile(x, probs)
sum sum(x, na.rm=T)
minimum value min(x, na.rm=T)
maximum value max(x, na.rm=T)
range of numeric values range(x, na.rm=T)
summmary** summary(x)

Notes:

  • quantile(): x is the numeric vector to examine, and probs = is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85)
  • summary(): gives a summary on a numeric vector including mean, median, and common percentiles

DANGER: If providing a vector of numbers to one of the above functions, be sure to wrap the numbers within c() .

# If supplying raw numbers to a function, wrap them in c()
mean(1, 6, 12, 10, 5, 0)    # !!! INCORRECT !!!  
## [1] 1
mean(c(1, 6, 12, 10, 5, 0)) # CORRECT
## [1] 5.666667

Other useful functions:

Objective Function Example
create a sequence seq(from, to, by) seq(1, 10, 2)
repeat x, n times rep(x, ntimes) rep(1:3, 2) or rep(c("a", "b", "c"), 3)
subdivide a numeric vector cut(x, n) cut(linelist$age, 5)
take a random sample sample(x, size) sample(linelist$id, size = 5, replace = TRUE)

%in%

A very useful operator for matching values, and for quickly assessing if a value is within a vector or dataframe.

my_vector <- c("a", "b", "c", "d")
"a" %in% my_vector
## [1] TRUE
"h" %in% my_vector
## [1] FALSE

To ask if a value is not %in% a vector, put an exclamation mark (!) in front of the logic statement:

# to negate, put an exclamation in front
!"a" %in% my_vector
## [1] FALSE
!"h" %in% my_vector
## [1] TRUE

%in% is very useful when using the dplyr function case_when(). You can define a vector previously, and then reference it later. For example:

affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")

linelist <- linelist %>% 
  mutate(child_hospitaled = case_when(
    hospitalized %in% affirmative & age < 18 ~ "Hospitalized Child",
    TRUE                                      ~ "Not"))

Note: If you want to detect a partial string, perhaps using str_detect() from stringr, it will not accept a character vector like c("1", "Yes", "yes", "y"). Instead, it must be given a regular expression - one condensed string with OR bars, such as “1|Yes|yes|y”. For example, str_detect(hospitalized, "1|Yes|yes|y"). See the page on Characters and strings for more information.

You can convert a character vector to a named regular expression with this command:

affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")
affirmative
## [1] "1"   "Yes" "YES" "yes" "y"   "Y"   "oui" "Oui" "Si"
# condense to 
affirmative_str_search <- paste0(affirmative, collapse = "|")  # option with base R
affirmative_str_search <- str_c(affirmative, collapse = "|")   # option with stringr package

affirmative_str_search
## [1] "1|Yes|YES|yes|y|Y|oui|Oui|Si"

Errors & warnings

This section explains:

  • The difference between errors and warnings
  • General syntax tips for writing R code
  • Code assists

Common errors and warnings and troubleshooting tips can be found in the page on [Errors and warnings].

Error versus Warning

When a command is run, the R Console may show you warning or error messages in red text.

  • A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.

  • An error means that R was not able to complete your command.

Look for clues:

  • The error/warning message will often include a line number for the problem.

  • If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.

If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!

General syntax tips

A few things to remember when writing commands in R, to avoid errors and warnings:

  • Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk
  • Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead
  • Keep track of and remember to separate a function’s arguments with commas
  • R is case-sensitive, meaning Variable_A is different from variable_A

Code assists

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.

Transition to R

Below, we provide some advice and resources if you are transitioning to R from another statistical language.

From Stata

Coming to R from Stata

Many epidemiologists are first taught how to use Stata, and it can seem daunting to move into R. However, if you are a comfortable Stata user then the jump into R is certainly more manageable than you might think. While there are some key differences between Stata and R in how data can be created and modified, as well as how analysis functions are implemented – after learning these key differences you will be able to translate your skills.

Below are some key translations between Stata and R, which may be handy as your review this guide.

General notes

STATA R
You can only view and manipulate one dataset at a time You can view and manipulate multiple datasets at the same time, therefore you will frequently have to specify your dataset within the code
Online community available through https://www.statalist.org/ Online community available through RStudio, StackOverFlow, and R-bloggers
Point and click functionality as an option Minimal point and click functionality
Help for commands available by help [command] Help available by [function]? or search in the Help pane
Comment code using * or /// or /* TEXT */ Comment code using #
Almost all commands are built-in to Stata. New/user-written functions can be installed as ado files using ssc install [package] R installs with base functions, but typical use involves installing other packages from CRAN (see page on R basics)
Analysis is usually written in a do file Analysis written in an R script in the RStudio source pane. R markdown scripts are an alternative.

Working directory

STATA R
Working directories involve absolute filepaths (e.g. “C:/usename/documents/projects/data/”) Working directories can be either absolute, or relative to a project root folder by using the here package (see Import and export)
See current working directory with pwd Use getwd() or here() (if using the here package), with empty parentheses
Set working directory with cd “folder location” Use setwd(“folder location”), or set_here("folder location) (if using here package)

Importing and viewing data

STATA R
Specific commands per file type Use import() from rio package for almost all filetypes. Specific functions exist as alternatives (see Import and export)
Reading in csv files is done by import delimited “filename.csv” Use import("filename.csv")
Reading in xslx files is done by import excel “filename.xlsx” Use import("filename.xlsx")
Browse your data in a new window using the command browse View a dataset in the RStudio source pane using View(dataset). You need to specify your dataset name to the function in R because multiple datasets can be held at the same time. Note capital “V” in this function
Get a high-level overview of your dataset using summarize, which provides the variable names and basic information Get a high-level overview of your dataset using summary(dataset)

Basic data manipulation

STATA R
Dataset columns are often referred to as “variables” More often referred to as “columns” or sometimes as “vectors” or “variables”
No need to specify the dataset In each of the below commands, you need to specify the dataset - see the page on Cleaning data and core functions for examples
New variables are created using the command generate varname = Generate new variables using the function mutate(varname = ). See page on Cleaning data and core functions for details on all the below dplyr functions.
Variables are renamed using rename old_name new_name Columns can be renamed using the function rename(new_name = old_name)
Variables are dropped using drop varname Columns can be removed using the function select() with the column name in the parentheses following a minus sign
Factor variables can be labeled using a series of commands such as label define Labeling values can done by converting the column to Factor class and specifying levels. See page on Factors. Column names are not typically labeled as they are in Stata.

Descriptive analysis

STATA R
Tabulate counts of a variable using tab varname Provide the dataset and column name to table() such as table(dataset$colname). Alternatively, use count(varname) from the dplyr package, as explained in Grouping data
Cross-tabulaton of two variables in a 2x2 table is done with tab varname1 varname2 Use table(dataset$varname1, dataset$varname2 or count(varname1, varname2)

While this list gives an overview of the basics in translating Stata commands into R, it is not exhaustive. There are many other great resources for Stata users transitioning to R that could be of interest:

From SAS

Under construction

Data interoperability

See the Import and export page for details on how the R package rio can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.

Import and export

In this page we describe ways to locate, import, and export files:

  • Using the rio package to import() and export() data
  • The here package to locate files in your computer or R project
  • Specific import scenarios, such as
    • Excel sheets
    • Google sheets
    • Websites
    • Skipping rows
  • Exporting/saving data files

Overview

When you import a “dataset” into R, you are generally creating a new data frame object in your R environment and defining it as the imported flat file (Excel, CSV, etc.). To learn more about objects and the assignment operator, see the page on R Basics.

The rio package

The R package we recommend is: rio. The import() function from rio utilizes the file extension (e.g. .xlsx, .csv, .rds, .tsv) to correctly import or export the file. The name “rio” is an abbreviation of “R I/O” (input/output).

The alternative to using rio is to use functions from many other packages, each of which is specific to a type of file. For example, read.csv() (base R), read.xlsx() (openxlsx package), and write_csv() (readr pacakge), etc. These alternatives can be difficult to remember, whereas using import() and export() from rio is easy.

rio’s functions import() and export() use the appropriate package and function for a given file, based on its file extension. See the end of this page for a complete table of which packages/functions rio uses in the background. It can also be used to import STATA, SAS, and SPSS files, among dozens of others.

Import/export of shapefiles requires other packages, as detailed in the page on GIS basics.

File paths

When importing or exporting data, you must provide a file path. You can do this one of three ways:

  1. Provide the “full” / “absolute” file path
  2. Provide a “relative” file path - the location relative to an R project root directory
  3. Manual file selection

“Absolute” file paths

This is an example of an absolute file path, placed within quotes and provided to import(), to be saved in R as the data frame object named linelist.

linelist <- import("C:/Users/Pierre/My Documents/epiproject/data/linelists/ebola_linelist.xlsx")

A few things to note about absolute file paths:

  • Avoid using absolute file paths as they will usually break if the script is run on a different computer
  • They can be used if you must load data from a shared drive folder distant from where your R script is saved
  • Use forward slashes as in the example above (this is NOT the default for Windows file paths)
  • File paths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the page on Directory interactions for more details on this issue.

“Relative” file paths

Below, the same file location is given relative to an R project root folder “epiproject”. This assumes you are working in an R project and have loaded the package here.

linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))

The package here and its function here() locate files on your computer in relation to the root directory of an R project, and provide a shortcut syntax for providing a file path to a function like import(). This is how here() works:

  • Ensure you are working within an R project (read more on the R projects page
  • When the here package is first loaded within the R project, it places a small file called “here” in the root-level folder of your R project as a “benchmark” or “anchor” for all other files in the project.
  • In your script, if you want to reference a file saved in your project’s folders, you use the function here() to tell R where the file is located in relation to that anchor.
  • here() can be used for both importing and exporting.

If you are unsure where “here” root is set to, run the function here() with empty parentheses:

# load the package
pacman::p_load(here)

# return the folder path that "here" is set to 
here()
## [1] "C:/Users/Neale/OneDrive - Neale Batra/Documents/Analytic Software/R/Projects/R handbook/Epi_R_handbook"

You can build onto that anchor by specifying further folders, within quotes, separated by commas, finally ending with the filename and extension. This approach also removes slash direction as a source of error.

Running the here() command with folder names and file name returns the extended file path, which can then processed by the import() function.

# the filepath
here("data", "linelist.xlsx")
## [1] "C:/Users/Neale/OneDrive - Neale Batra/Documents/Analytic Software/R/Projects/R handbook/Epi_R_handbook/data/linelist.xlsx"

Select file manually

You can import data manually via one of these methods:

  1. Environment RStudio Pane, click “Import Dataset”, and select the type of data
  2. Click File / Import Dataset / (select the type of data)
  3. To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
# Manual selection of a file. When this command is run, a POP-UP window will appear. 
# The file path selected will be supplied to the import() command.

my_data <- import(file.choose())

TIP: The pop-up window may appear BEHIND your RStudio window.

Excel sheets

If you want to import a specific sheet from an Excel workbook, include the sheet name to the which = argument. For example:

my_data <- import("my_excel_file.xlsx", which = "Sheetname")

If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parentheses of the here() function.

# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelist.xlsx"), which = "Sheet1")`  

To export a dataframe from R to a specific Excel sheet and have the rest of the Excel workbook remain unchanged, you will have to import, edit, and export with an alternative package catered to this purpose such as openxlsx. See more information in the page on Directory interactions or at this github page.

If your Excel workbook is .xlsb (binary format Excel workbook) you may not be able to import it using rio. Consider re-saving it as .xlsx, or using a package like readxlsb which is built for this purpose.

Missing values

You may want to designate which value(s) in your dataset should be considered as missing. As explained in the page on Missing data, the value in R for missing data is NA, but perhaps the dataset you want to import uses 99, “Missing”, or just empty character space "" instead.

Use the na = argument for import() and provide the value(s) within quotes (even if they are numbers). You can specify multiple values by including them within a vector, using c() as shown below.

linelist <- import(here("data", "linelist_raw.xlsx"), na ="99")
linelist <- import(here("data", "cleaning_dict.csv"), na = c("Missing", "", " "))

Google sheets

You can import data from an online Google spreadsheet with the googlesheet4 package and by authenticating your access to the spreadsheet.

pacman::p_load("googlesheets4")

Below, a demo Google sheet is imported and saved. This command may prompt confirmation of authentification of your Google account. Follow prompts and pop-ups in your internet browser to grant Tidyverse API packages permissions to edit, create, and delete your spreadsheets in Google Drive.

The sheet below is “viewable for anyone with the link” and you can try to import it.

Gsheets_demo <- read_sheet("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")

The sheet can also be imported using only the sheet ID, a shorter part of the URL:

Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY")

Another package, googledrive offers useful functions for writing, editing, and deleting Google sheets. For example, using the gs4_create() and sheet_write() functions found in this package.

Here are some other helpful online tutorials: basic importing tutorial more detail interaction between the two packages

Scraping websites

Scraping data from a website - TBD - Under construction

Skip rows

Sometimes, you may want to avoid importing a row of data. You can do this with the argument skip = if using import() from rio on a .xlsx or .csv file. Provide the number of rows you want to skip.

linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row

Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10” does not work). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

Remove second header row

Sometimes, your data may have a second row that you want to remove, for example if it is a “data dictionary” row as shown below. This situation can be problematic because it can result in all columns being imported as class “character”.

To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact argument used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). This is because rio is using a different function for the different file types (see table above).

For Excel files: (col_names =)

# import first time; store the column names
linelist_raw_names <- import("linelist_raw.xlsx") %>% names()  # save true column names

# import second time; skip row 2, and assign column names to argument col_names =
linelist_raw <- import("linelist_raw.xlsx",
                       skip = 2,
                       col_names = linelist_raw_names
                       ) 

For CSV files: (col.names =)

# import first time; sotre column names
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names

# note argument for csv files is 'col.names = '
linelist_raw <- import("linelist_raw.csv",
                       skip = 2,
                       col.names = linelist_raw_names
                       ) 

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. See this vignette

Make a data dictionary

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. This tip is adapted from this post.

dict <- linelist_2headers %>%             # begin: linelist with dictionary as first row
  head(1) %>%                             # keep only column names and first dictionary row                
  pivot_longer(cols = everything(),       # pivot all columns to long format
               names_to = "Column",       # assign new column names
               values_to = "Description")

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row. Replace “df” with the name of your dataset.

names(df) <- paste(names(df), df[1, ], sep = "_")

Manual data entry

Entry by rows

Use the tribble function from the tibble package from the tidverse (onlinetibble reference).

Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.). You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. Spaces do not matter between values, but each row is represented by a new line of code. For example:

# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
  ~colA, ~colB,
  "a",   1,
  "b",   2,
  "c",   3
  )

And now we display the new dataset:

Entry by columns

Since a data frame consists of vectors (vertical columns), the base approach to manual dataframe creation in R expects you to define each column and then bind them together. This can be counter-intuitive in epidemiology, as we usually think about our data in rows (as above).

# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death     <- c(1, 0, 1, 0)

CAUTION: All vectors must be the same length (same number of values).

The vectors can then be bound together using the function data.frame():

# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)

And now we display the new dataset:

Pasting from clipboard

If you copy data from elsewhere and have it on your clipboard, you can try the following function from base R to convert those data into an R data frame:

df_from_clipboard <- read.table(
  file = "clipboard",  # specify this as "clipboard"
  sep = "t",           # separator could be tab, or commas, etc.
  header=TRUE)         # if there is a header row

Export

With rio, you can use the export() function in a very similar way to import(). First give the name of the R object you want to save (e.g. linelist) and then in quote the filepath including name and file extension. For example:

export(linelist, "my_linelist.xlsx") # will save to working directory

You could save the same dataframe as a .csv, and to a folder specified by here relative pathway:

export(linelist, here("data","clean", "my_linelist.csv")

RDS files

Along with .csv, .xlsx, etc, you can also export/save R data frames as .rds files. This is a file format specific to R, and is very useful if you know you will work with the exported data again in R.

The classes of columns are stored, so you don’t have do to cleaning again when it is imported (with an Excel or even a CSV file this can be a headache!).

For example, if you work in an Epi team and need to send files to a GIS team for mapping, and they use R as well, just send them the .rds file! Then all the column classes are retained and they have less work to do.

export(linelist, here("data","clean", "my_linelist.rds")

Rdata files

.Rdata files store R objects, and can actually store multiple R objects within one file, for example multiple dataframes, model results, lists, etc. This can be very useful to consolidate or share a lot of your data for a given project.

In the below example, multiple R objects are stored within the exported file “my_objects.Rdata”:

rio::export(my_list, my_dataframe, my_vector, "my_objects.Rdata")

Note: if you are trying to import a list, use import_list() from rio to import it witht the complete original structure and contents.

rio::import_list("my_list.Rdata")

Saving plots

How to save plots, such as those created by ggplot() is discussed in depth in the ggplot tips page.

In brief, run ggsave("my_plot_filepath_and_name.png") after printing your plot. You can either provide a saved plot object to the plot = argument, or only specify the destination file path (with file extension) to save the most recently-displayed plot. You can also control the width =, height =, units =, and dpi =.

How to save a network graph, such as a transmission tree, is addressed in the page on Transmission chains.

Resources

The R Data Import/Export Manual
R 4 Data Science chapter
ggsave

Below is a table, taken from the rio online vignette. For each type of data it shows: the expected file extension, the package rio uses to import or export the data, and whether this functionality is included in the default installed version of rio.

Format Typical Extension Import Package Export Package Installed by Default
Comma-separated data .csv data.table fread() data.table Yes
Pipe-separated data .psv data.table fread() data.table Yes
Tab-separated data .tsv data.table fread() data.table Yes
SAS .sas7bdat haven haven Yes
SPSS .sav haven haven Yes
Stata .dta haven haven Yes
SAS XPORT .xpt haven haven
SPSS Portable .por haven Yes
Excel .xls readxl Yes
Excel .xlsx readxl openxlsx Yes
R syntax .R base base Yes
Saved R objects .RData, .rda base base Yes
Serialized R objects .rds base base Yes
Epiinfo .rec foreign Yes
Minitab .mtp foreign Yes
Systat .syd foreign Yes
“XBASE” database files .dbf foreign foreign
Weka Attribute-Relation File Format .arff foreign foreign Yes
Data Interchange Format .dif utils Yes
Fortran data no recognized extension utils Yes
Fixed-width format data .fwf utils utils Yes
gzip comma-separated data .csv.gz utils utils Yes
CSVY (CSV + YAML metadata header) .csvy csvy csvy No
EViews .wf1 hexView No
Feather R/Python interchange format .feather feather feather No
Fast Storage .fst fst fst No
JSON .json jsonlite jsonlite No
Matlab .mat rmatio rmatio No
OpenDocument Spreadsheet .ods readODS readODS No
HTML Tables .html xml2 xml2 No
Shallow XML documents .xml xml2 xml2 No
YAML .yml yaml yaml No
Clipboard default is tsv clipr clipr No

R projects

An R project enables your work to be bundled in a self-contained folder. Within the project, all the relevant scripts, data files, figures/outputs, and history are be stored in sub-folders and importantly - the working directory is the project’s root folder.

Optimal use

A common, efficient, and trouble-free way to use R is to combine these 3 elements. Each is described in the sections below.

  1. An R project
    • A self-contained working environment with folders for data, scripts, outputs, etc.
  2. The here package for relative filepaths
    • Filepaths are written relative to the root folder of the R project - see Import and export for more information
  3. The rio for importing/exporting
    • import() and export() handle any file type by by its extension (e.g. .csv, .xlsx, .png)

Creating an R project

To create an R project, select “New Project” from the File menu.

  • If you want to create a new folder for the project, select “New directory” and indicate where you want it to be created.
  • If you want to create the project within an existing folder, click “Existing directory” and indicate the folder.
  • If you want to clone a Github repository, select the third option “Version Control” and then “Git”. See the page on [Collaboration with Github] for further details.

The R project you create will come in the form of a folder containing a .Rproj file. This file is a shortcut and likely the primary way you will open your project. You can also open a project by selecting “Open Project” from the File menu. Alternatively on the far upper right side of RStudio you will see an R project icon and a drop-down menu of available R projects.

To exit from an R project, either open a new project, or close the project (File - Close Project).

Switch projects

To switch between projects, click the R project icon and drop-down menu at the very top-right of RStudio. You will see options to Close Project, Open Project, and a list of recent projects.

Settings

It is generally advised that you start RStudio each time with a “clean slate” - that is, with your workspace not preserved from your previous session. This will mean that your objects and results will not persist session-to-session (you must re-create them by running your scripts). This is good, because it will force you to write better scripts and avoid errors in the long run.

To set RStudio to have a “clean slate” each time at start-up:

  • Select “Project Options” from the Tools menu.
  • In the “General” tab, set RStudio to not restore .RData into workspace at startup, and to not save workspace to .RData on exit.

Organization

It is common to have subfolders in your project. Consider having folders such as “data”, “scripts”, “figures”, “presentations”.

Version control

Consider a version control system. It could be something as simple as having dates on the names of scripts (e.g. “transmission_analysis_2020-10-03.R”) and an “archive” folder. Consider also having commented header text at the top of each script with a description, tags, authors, and change log.

A more complicated method would involve using Github or a similar platform for version control. See the page on [Collaboration with Github].

One tip is that you can search across an entire project or folder using the “Find in Files” tool (Edit menu). It can search and even replace strings across multiple files.

Examples

Below are some examples of import/export/saving using here():

Importing linelist.xlsx from the “data” folder in your R project

linelist <- import(here("data", "linelist.xlsx"))

Exporting the R object linelist as “my_linelist.rds” to the “clean” folder within the “data” folder in your R project.

export(linelist, here("data","clean", "my_linelist.rds")

Saving the most recently printed plot as “epicurve_2021-02-15.png” within the “epicurves” folder in “outputs” folder in your R project.

ggsave(here::here("outputs", "epicurves", "epicurve_2021-02-15.png"))

Resources

RStudio webpage on using R projects

Suggested packages

Below is a long list of suggested packages for common epidemiological work in R. You can copy this code and use # symbols to remove any packages you do not want.

  • Packages that are included when installing/loading another package are indicated by an indent and hash. For example how ggplot2 is listed under tidyverse.
  • Install the pacman package first, before running the below. You can do this with install.packages("pacman").

Also, consider using the package conflicted to manage conflicts and masking of functions. Masking occurs when two packages have a function with the same name. See the R basics section on packages for details and ways to resolve this.

# List of common & useful epidemiology R packages  

pacman::p_load(
     
     # learning R
     learnr,   # interactive tutorials in RStudio
        
     # project and file management
     here,     # filepaths relative to root project folder
     rio,      # import/export of many types of data
     openxlsx, # import/export of Excel workbooks 
     
     # package install and management
     pacman,   # package install/load
     renv,     # managing versions of packages in collaborative groups
     remotes   # install from github
     
     # General data management
     tidyverse,    # includes many packages for tidy data wrangling and presentation
          #dplyr,
          #tidyr,
          #ggplot2,
     linelist,     # cleaning linelists
     lubridate,    # working with dates
     naniar,       # assessing missing data
     
     # statistics  
     gtsummary,    # making descriptive and statistical tables
     
     
     # epidemic modeling
     epicontacts,  # Analysing transmission networks
     EpiNow2,      # Rt estimation
     EpiEstim,     # Rt estimation
     projections,  # Incidence projections
     incidence,    # Handling incidence data
     epitrix,      # Useful epi functions
     distcrete,    # Discrete delay distributions
     
     
     # plots - general
     #ggplot2,         # included in tidyverse
     cowplot,          # combining plots
     RColorBrewer      # color scales
     
     # plots - specific types
     DiagrammeR,       # diagrams using DOT language
     incidence,        # epidemic curves
     
     # gis
     sf,               # to manage spatial data using a Simple Feature format
     tmap,             # to produce simple maps, works for both interactive and static maps
     OpenStreetMap,    # to add OSM basemap in ggplot map
     
     # routine reports  
     rmarkdown,        # produce PDFs, Word Documents, Powerpoints, and HTML files
     reportfactory,    # Auto-organization of Rmarkdown outputs
     
     # tables
     knitr,            # report generation, kable() for html tables
     DT,               # HTML tables
     gt,               # HTML tables
     
     # phylogenetics  
     ggtree,           # visualization and annotation of trees
     ape,              # analysis of phylogenetics and evolution
     
     # interactive
     plotly,           # interactive graphics
     shiny,            # interactive web apps  
)

III Data Management

Cleaning data and core functions

This page demonstrates common steps necessary to clean a dataset, starting with importing raw data and demonstrating a “pipe chain” of cleaning steps. We use a simulated Ebola case linelist, which is referenced often in this handbook.

This page also explains the use of many core functions used in data management, including:

  • %>% - pipe to pass the dataset from one function to the next
  • mutate() - to create, transform, and re-define columns
  • select() - to select or re-name columns
  • rename() - to rename columns
  • across() - to transform multiple columns at one time
  • filter() - to keep certain rows
  • add_row() - to add row manually
  • clean_names() - to standardize the syntax of column names
  • as.characer(), as.numeric(), as.Date(), etc. - to convert the class of a column
  • recode() - to re-code values in a column
  • case_when() - to re-code values in a column using more complex logical criteria
  • replace_na(), na_if(), coalesce() - special functions for re-coding
  • clean_data() - to re-code/clean using a data dictionary
  • age_categories() and cut() - to create categorical groups from a numeric column
  • distinct() - to de-duplicate rows

If you want to see how these functions compare to Stata or SAS, see the page on Transition to R.

Cleaning pipeline

This page proceeds through typical cleaning steps, adding them sequentially to a cleaning pipe chain.

In epidemiological analysis and data processing, cleaning steps are often performed linked together, sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another.

Such chain utilize dplyr “verb” functions and the magrittr pipe operator %>%. This pipe begins with the “raw” data (“linelist_raw.xlsx”) and ends with a “clean” R data frame (linelist).

In a cleaning pipeline the order of the steps is important. Cleaning steps might include:

  • Importing of data
  • Column names cleaned or changed
  • De-duplication
  • Column creation and transformation (e.g. re-coding or cleaning values)
  • Rows filtered or added

Load packages

Below are the packages used in this page:

pacman::p_load(
  rio,        # importing data  
  here,       # relative file pathways  
  janitor,    # data cleaning and tables
  lubridate,  # working with dates
  epikit,     # age_categories() function
  tidyverse   # data manipulation and visualization
)

Import data

Import

Here we import the raw .xlsx dataset using the import() function from the package rio, and save it as the data frame linelist_raw. If your dataset is large and takes a long time to import, it can be useful to have the import command be separate from the pipe chain and the “raw” saved as a distinct file. This also allows easy comparison between the original and cleaned versions.

See the page on Import and export for more details and unusual situations, including:

  • Skipping the import of certain rows
  • Dealing with a second row that is a data dictionary
  • Importing from Google sheets

Below we import the raw .xlsx file. We assume it is located in the working directory and so no sub-folders are specified in the filepath.

linelist_raw <- import("linelist_raw.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

Review

You can use the package skimr and its function skim() to get an overview of the entire dataframe (see page on Descriptive analysis for more info). Columns are summarised by class (character, numeric, POSIXct - a type of date class).

Table 1: Data summary
Name linelist_raw
Number of rows 6479
Number of columns 28
_______________________
Column type frequency:
character 17
numeric 8
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 5 1.00 6 6 0 5888 0
date onset 5 1.00 10 10 0 563 0
outcome 1456 0.78 5 7 0 2 0
gender 321 0.95 1 1 0 2 0
hospital 1472 0.77 5 36 0 13 0
infector 2274 0.65 6 6 0 2697 0
source 2274 0.65 5 7 0 2 0
age 100 0.98 1 2 0 79 0
age_unit 5 1.00 5 6 0 2 0
fever 253 0.96 2 3 0 2 0
chills 253 0.96 2 3 0 2 0
cough 253 0.96 2 3 0 2 0
aches 253 0.96 2 3 0 2 0
vomit 253 0.96 2 3 0 2 0
time_admission 811 0.87 5 5 0 1095 0
merged_header 0 1.00 1 1 0 1 0
…28 0 1.00 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
generation 5 1.00 16.58 5.71 0.00 13.00 16.00 20.00 37.00
lon 5 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21
lat 5 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49
row_num 0 1.00 3240.00 1870.47 1.00 1620.50 3240.00 4859.50 6479.00
wt_kg 5 1.00 53.17 18.57 -8.00 42.00 55.00 66.00 121.00
ht_cm 5 1.00 124.89 49.44 6.00 89.00 130.00 158.00 335.00
ct_blood 5 1.00 21.25 1.65 16.00 20.00 22.00 22.00 26.00
temp 140 0.98 38.58 0.98 35.20 38.20 38.80 39.20 40.90

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
infection date 2273 0.65 2012-04-18 2015-04-27 2014-10-03 523
hosp date 5 1.00 2012-04-25 2015-04-30 2014-10-15 560
date_of_outcome 1041 0.84 2012-04-30 2015-06-04 2014-10-25 556
skimr::skim_without_charts(linelist_raw)

Column names

Column names are used very often, so they must have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replace with underscores _ )
  • No unusual characters (&, #, <, >, …)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are printed below using names() from base R. We can see that initially:

  • Some names contain spaces (e.g. infection date)
  • Different naming patterns are used for dates (date onset vs. infection date)
  • There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was applied to the first one, and the second column was assigned a placeholder name “…28”, as it was then empty and is the 28th column.
names(linelist_raw)
##  [1] "case_id"         "generation"      "infection date"  "date onset"      "hosp date"       "date_of_outcome" "outcome"        
##  [8] "gender"          "hospital"        "lon"             "lat"             "infector"        "source"          "age"            
## [15] "age_unit"        "row_num"         "wt_kg"           "ht_cm"           "ct_blood"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "...28"

NOTE: To reference a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. note that on your keyboard, the back-tick (`) is different from the single quotation mark (’).

Automatic cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can specify name replacements with the replace = argument (e.g. replace = c(onset = "date_of_onset"))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "case_id"         "generation"      "infection_date"  "date_onset"      "hosp_date"       "date_of_outcome" "outcome"        
##  [8] "gender"          "hospital"        "lon"             "lat"             "infector"        "source"          "age"            
## [15] "age_unit"        "row_num"         "wt_kg"           "ht_cm"           "ct_blood"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "x28"

NOTE: The last column name “…28” was changed to “x28”.

Manual name cleaning

Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

##  [1] "case_id"              "generation"           "date_infection"       "date_onset"           "date_hospitalisation"
##  [6] "date_outcome"         "outcome"              "gender"               "hospital"             "lon"                 
## [11] "lat"                  "infector"             "source"               "age"                  "age_unit"            
## [16] "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

Rename by column position

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn  = 1,
       newNameForSecondColumn = 2)

Rename via select()

You can also rename columns within the dplyr select() function, which is used to retain only certain columns (and is covered later in this page). This approach also uses the format new_name = old_name. Here is an example:

linelist_raw %>% 
  select(# NEW name             # OLD name
         date_infection       = `infection date`,    # rename and KEEP ONLY these columns
         date_hospitalisation = `hosp date`)

Other challenges

Empty Excel column names

If you are importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see example above), or their name (linelist_raw$...1).

Merged Excel column names and cells

Merged cells in an Excel file are a common occurrence when receiving data from operational teams. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accommodate merged cells.

Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)

DANGER: If column names are merged with readWorkbook(), you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Select or re-order columns

Use select() from dplyr to select the columns you want to retain, and specify their order in the data frame.

CAUTION: In the examples below, linelist is modified with select() but not over-written. The modified column names are only displayed via names() for purpose of example.

Here are ALL the column names in the linelist at this point in the cleaning pipe chain:

names(linelist)
##  [1] "case_id"              "generation"           "date_infection"       "date_onset"           "date_hospitalisation"
##  [6] "date_outcome"         "outcome"              "gender"               "hospital"             "lon"                 
## [11] "lat"                  "infector"             "source"               "age"                  "age_unit"            
## [16] "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

Keep columns

Select only the columns you want to remain

Put their names in the select() command, with no quotation marks. They will appear in the data frame in the order you provide. Note that if you include a column that does not exist, R will return an error (see use of any_of() below if you want no error in this situation).

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names()  # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation" "fever"

Helper functions

Helper functions and operators exist to make it easy to specify columns to keep or discard.

For example, if you want to re-order the columns, everything() is useful to signify “all other columns not yet mentioned”. The command below pulls columns date_onset and date_hospitalisation to the beginning, but keeps all the others afterward:

# move date_onset and date_hospitalisation to beginning
linelist %>% 
  select(date_onset, date_hospitalisation, everything()) %>% 
  names()
##  [1] "date_onset"           "date_hospitalisation" "case_id"              "generation"           "date_infection"      
##  [6] "date_outcome"         "outcome"              "gender"               "hospital"             "lon"                 
## [11] "lat"                  "infector"             "source"               "age"                  "age_unit"            
## [16] "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

Here are other helpers functions that work within select():

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix
    • example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix
    • example: select(ends_with("_end"))
  • contains() - columns containing a character string
    • example: select(contains("time"))
  • matches() - to apply a regular expression (regex)
    • example: select(contains("[pt]al"))
  • num_range() - a numerical range like x01, x02, x03
  • any_of() - matches IF column exists but returns no error if it is not found
    • example: select(any_of(date_onset, date_death, cardiac_arrest))

In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

Use where() to specify logical criteria for columns. If providing a function inside where(), do not include the empty parentheses. The command below selects columns that are class Numeric.

# select columns that are class Numeric
linelist %>% 
  select(where(is.numeric)) %>% 
  names()
## [1] "generation" "lon"        "lat"        "row_num"    "wt_kg"      "ht_cm"      "ct_blood"   "temp"

Use contains() to select only columns in which the column name contains a string. ends_with() and starts_with() provide more nuance.

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"

The function matches() works similarly to contains() but can be provided a regular expression (see page on Characters and strings), such as multiple strings separated by OR bars within the parentheses:

# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"             "fever"

CAUTION: If a column name that you specifically provide does not exist in the data, it can return an error and stop your code. Consider using any_of() to cite columns that may or may not exist, especially useful in negative (remove) selections.

Only one of these columns exists, but no error is produced and the code continues.

linelist %>% 
  select(any_of(c("date_onset", "village_origin", "village_detection", "village_residence", "village_travel"))) %>% 
  names()
## [1] "date_onset"

Remove columns

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all cols from fever to vomit
  names()
##  [1] "case_id"              "generation"           "date_infection"       "date_hospitalisation" "date_outcome"        
##  [6] "outcome"              "gender"               "hospital"             "lon"                  "lat"                 
## [11] "infector"             "source"               "age"                  "age_unit"             "row_num"             
## [16] "wt_kg"                "ht_cm"                "ct_blood"             "temp"                 "time_admission"      
## [21] "merged_header"        "x28"

Standalone

select() can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add to the pipe chain

In the linelist_raw, there are a few columns we do not need: row_num, merged_header, and x28. We remove them with a select() command in the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    #####################################################

    # remove column
    select(-c(row_num, merged_header, x28))

Deduplication

See the handbook page on De-duplication. Only a very simple de-duplication example is presented here.

The package dplyr offers the distinct() function to reduce the dataframe to only unique rows - removing rows that are 100% duplicates. We just add the simple command distinct() to the pipe chain:

We begin with 6479 rows in linelist.

linelist <- linelist %>% 
  distinct()

After de-duplication there are 6479 rows. Any removed would have been rows that were 100% duplicates of other rows.

Below, the distinct() command is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    #####################################################
    
    # de-duplicate
    distinct()

Column creation and transformation

The dplyr function mutate() is used to add a new column, or to modify an existing one.

Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation)

In Stata, this is similar to the command generate, but R’s mutate() can also be used to modify an existing column.

New columns

The most basic mutate() command to create a new column might look like this. It creates a new column new_col where the value in every row is 10.

linelist <- linelist %>% 
  mutate(new_col = 10)

You can also reference values in other columns, to perform calculations. For example below a new column bmi is created to hold the Body Mass Index (BMI) for each case - as calculated using the formula BMI = kg/m^2, using column ht_cm and column wt_kg.

linelist <- linelist %>% 
  mutate(bmi = wt_kg / (ht_cm/100)^2)

If creating multiple new columns, separate each with a comma and new line. Below, are examples of new columns, including pasting together values from other columns using str_glue() from the stringr package (see page on Characters and strings.

linelist <- linelist %>%                       
  mutate(
    new_var_dup    = case_id,             # new column = duplicate/copy another existing column
    new_var_static = 7,                   # new column = all values the same
    new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
    new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
    ) 

Scroll to the right to see the new columns that have been added (first 50 rows shown):

TIP: The function transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

Convert column class

Often you will need to set the correct class for a column. There are ways to set column class during the import commands, but often this is often cumbersome. See section on object classes to learn more about converting the class of objects, including columns.

First, run some checks on important columns to see if they are the correct class:

Currently, the class of the “age” column is character. To perform quantitative analyses, we need these numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analyses, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

In this case, use mutate() to define the column as itself, but converted to a different class. Here is a basic example, converting or ensuring that the column age is class Numeric:

linelist <- linelist %>% 
  mutate(age = as.numeric(age))

Examples of other converting functions:

# Examples of modifying class
linelist <- linelist %>% 
  mutate(date_var      = as.Date(date_var, format = "MM/DD/YYYY"),  # see page on Dates for details  
         numeric_var   = as.numeric(numeric_var),
         character_var = as.character(character_var),
         factor_var    = as_factor(factor_var)        # see page on Factors for details  
         )

Dates can be especially difficult to convert - there are several ways, but be careful with what you are doing. Typically, the raw date values must all be in the same format for conversion to work correctly (e.g “MM/DD/YYYY”, or “DD MM YYYY”). See the page on Working with dates for details. Especially after converting to class date, check your data visually or with a cross-table to confirm that each value was converted correctly. For as.Date(), the format = argument is often a source of errors.

Grouped data

If your dataframe is already grouped (see page on Grouping data), mutate() may behave differently than if the dataframe is not grouped. Any summarizing functions, like mean(), median(), max(), etc. will be based on only the grouped rows, not all the rows.

# age normalized to mean of ALL rows
linelist %>% 
  select(case_id, age, hospital) %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

# age normalized to mean of hospital group
linelist %>% 
  select(case_id, age, hospital) %>% 
  group_by(hospital) %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

Read more about using mutate on grouped dataframes in this tidyverse mutate documentation.

Transform multiple columns

Often to write concise code you want to apply the same transformation to multiple columns at once. A transformation can be applied to multiple columns at once using the across() function from the package dplyr (also contained within tidyverse package).

across() can be used with any dplyr function, but commonly with mutate(), filter(), or summarise(). across() allows you to specify which columns you want a function to apply to. To specify the columns, you can name them indvidually, or use helper functions.

Here the transformation as.character() is applied to specific columns named within across(). Note that functions in across() are written without their parentheses ( )

linelist <- linelist %>% 
  mutate(across(c(temp, ht_cm, wt_kg), as.character))

There are helpers available to assist you in specifying columns:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix
    • example: across(starts_with("date"))
  • ends_with() - matches to a specified suffix
    • example: across(ends_with("_end"))
  • contains() - columns containing a character string
    • example: across(contains("time"))
  • matches() - to apply a regular expression (regex)
    • example: across(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist
    • example: across(any_of(date_onset, date_death, cardiac_arrest))

Here is an example of how one would change all columns to character class:

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Columns where the name contains the string “date” (note placement of commas and parentheses):

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(contains("date"), as.character))

Below, we want to mutate the columns where they are class POSIXct (a datetime class that shows timestamps) - in other words, where the function is.POSIXct() evaluates to TRUE. Then we want to apply the function as.Date() to these columns to convert them to a normal class Date.

linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))
  • Note that within across() we also use the function where()
  • Note that is.POSIXct() is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R

Here are a few online resources on using across(): creator Hadley Wickham’s thoughts/rationale

coalesce()

This dplyr function finds the first non-missing value at each position.

Say you have two vectors/columns, one for village of detection and another for village of residence. You can use coalesce to pick the first non-missing value for each index:

village_detection <- c("a", "b", NA,  NA)
village_residence <- c("a", "c", "a", "d")

village <- coalesce(village_detection, village_residence)
village    # print
## [1] "a" "b" "a" "d"

This works the same if you provide data frame columns: for each row, the function will assign the new column value with the first non-missing value in the columns you provided (in order provided).

linelist <- linelist %>% 
  mutate(village = coalesce(village_detection, village_residence))

For more complicated row-wise calculations, see the section below on Row-wise calculations.

Cumulative math

If you want a column to reflect the cumulative sum/mean/min/max etc as assessed down the rows of a dataframe, use the following functions:

cumsum() returns the cumulative sum, as shown below:

sum(c(2,4,15,10))     # returns only one number
## [1] 31
cumsum(c(2,4,15,10))  # returns the cumulative sum at each step
## [1]  2  6 21 31

This can be used in a dataframe when making a new column. For example, to calculate the cumulative number of cases per day in an outbreak, consider code like this:

cumulative_case_counts <- linelist %>% 
  count(date_onset) %>%                 # count of rows per day   
  mutate(cumulative_cases = cumsum(n))  # new column of the cumulative sum at that row

Below are the first 10 rows:

head(cumulative_case_counts, 10)
##    date_onset n cumulative_cases
## 1  2012-04-21 1                1
## 2  2012-04-27 1                2
## 3  2012-05-17 1                3
## 4  2012-05-18 1                4
## 5  2012-05-31 1                5
## 6  2012-06-16 1                6
## 7  2012-06-30 1                7
## 8  2012-07-03 1                8
## 9  2012-07-04 1                9
## 10 2012-07-12 1               10

See the page on Epidemic curves for how to plot cumulative incidence with the epicurve.

See also:
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()

Using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add to pipe chain

Below, a new column is added to the pipe chain and some classes are converted.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    # add new column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>% 
  
    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) 

Re-code values

Here are a few scenarios where you need to re-code (change) values:

  • to edit one specific value (e.g. one date with an incorrect year or format)
  • to reconcile values not spelled the same
  • to create a new column of categories
  • to create a new column of numeric categories (e.g. age categories)

Specific values

To change values manually you can use the recode() function within the mutate() function.

Imagine there is a nonsensical date in the data (e.g. “2014-14-15”): you could fix the date in the source data, or, you could write the change into the cleaning pipeline via mutate() and recode().

# fix incorrect values                   # old value       # new value
linelist <- linelist %>% 
  mutate(date_onset = recode(date_onset, "2014-14-15" = "2014-04-15"))

The mutate() line above can be read as: “mutate the column date_onset to equal the column date_onset re-coded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this.

Here is another example re-coding multiple values within one column.

In linelist the values in the column “hospital” must be cleaned. There are several different spellings and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                      Central Hopital                     Central Hospital                           Hospital A 
##                                   11                                  443                                  289 
##                           Hospital B                     Military Hopital                    Military Hospital 
##                                  289                                   30                                  786 
##                     Mitylira Hopital                    Mitylira Hospital                                Other 
##                                    1                                   79                                  885 
##                         Port Hopital                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                   47                                 1725                                  411 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   11                                 1472

The recode() command below re-defines the column “hospital” as the current column “hospital”, but with the specified recode changes. Don’t forget commas after each!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      #    reference: OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

Now we see the spellings in the hospital column have been corrected and consolidated:

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A                           Hospital B 
##                                  454                                  289                                  289 
##                    Military Hospital                                Other                        Port Hospital 
##                                  896                                  885                                 1772 
## St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  422                                 1472

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

Missing values

See the page on Missing data for more detailed tips on handling missing values. dplyr offers two special functions for handling missing values:

replace_na()

To change missing values (NA) to a specific value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

na_if()

To convert a specific value to NA, use na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Note: na_if() cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:

# Convert temperatures above 40 to NA 
linelist <- linelist %>% 
  mutate(temp = replace(temp, temp > 40, NA))

# Convert onset dates earlier than 2000 to missing
linelist <- linelist %>% 
  mutate(date_onset = replace(date_onset, date_onset > as.Date("2000-01-01"), NA))

By logic

Below is demonstrated how to re-code values in a column using logic and conditions:

  • Using replace(), ifelse() and if_else() for simple logic
  • Using case_when() for more complex logic

Simple logic

replace()

To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:

mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).

One common situation is changing just one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.

# Example: change gender of one specific observation to "Female" 
linelist <- linelist %>% 
  mutate(gender = replace(gender, case_id == "2195", "Female")

The equivalent command using base R syntax and the indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.

linelist$gender[linelist$case_id == "2195"] <- "Female"

ifelse() and if_else()

Another tool for simple logical re-coding is ifelse() and its partner if_else(). However, in most cases it is better to use case_when() (for clarity).

These commands are simplified versions of an if and else programming statement. The general syntax is:
ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)

Below, the column source_known is defined (or re-defined). Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in source is missing, then the value in source_known is set to “unknown”.

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.

# Create a date of death column, which is NA if patient has not died.
linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Avoid stringing together many ifelse commands… use case_when() instead! case_when() is much easier to read and you’ll make fewer errors.

Outside of the context of a data frame, if you want to have an object used in your code switch its value, consider using switch() from base R. See the section on using switch() in the page on having an [Interactive console].

Complex logic

Use dplyr’s case_when() if you need to use complex logic statements to re-code values. There are important differences from recode() in syntax and logic order!

case_when() commands have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the LHS and the pursuant value is on the RHS. Statements are separated by commas. It is important to note that:

  • Statements are evaluated in the order written - from top-to-bottom. Thus it is best to write the most specific criteria first, and the most general last.
  • End with TRUE on the LHS, which signifies any row value that did not meet any of the previous criteria
  • The values on the RHS must all be the same class - either numeric, character, logical, etc.
    • To assign NA, you may need to use special values such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA)

Below we utilize the columns age and age_unit to create a column age_years:

linelist <- linelist %>% 
  mutate(age_years = case_when(
            age_unit == "years"  ~ age,       # if age is given in years
            age_unit == "months" ~ age/12,    # if age is given in months
            is.na(age_unit)      ~ age,       # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # any other circumstance assign missing

Cleaning dictionary

Use the package linelist to clean a linelist with a cleaning dictionary.

  1. Import a cleaning dictionary with 3 columns:
    • A “from” column (the incorrect value)
    • A “to” column (the correct value)
    • A column specifying the column for the changes to be applied (or “.global” to apply to all columns)

cleaning_dict <- import("cleaning_dict.csv")
  1. Store names of any columns that you want to “protect” from the changes. They must be provided to clean_data() as a numeric or logical vector, so you will see use of names(.) in the command below (the dot means the dataframe).
protected_cols <- c("case_id", "source")
  1. Run clean_data(), specifying the cleaning dictionary
linelist <- linelist %>% 
  linelist::clean_data(
    wordlists = cleaning_dict,
    spelling_vars = "col",       # dict column containing column names, defaults to 3rd column in dict
    protect = names(.) %in% protected_cols
  )

Scroll too see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.

CAUTION: clean_data() from linelist package will also clean values in your data unless those columns are protected - you may encounter changes to columns with dashes “-” or .

Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. clean_data() itself also implements a column name cleaning function similar to clean_names() from janitor that standardizes column names prior to applying the dictionary.

See this online reference for the linelist package for more details.

Add to pipe chain

Below, some new columns and column transformations are added to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
   # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
   ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

Numeric categories

Here we describe some special approaches for creating numeric categories. Common examples include age categories, groups of lab values, etc. Here we will discuss:

  • age_categories(), from the epikit package
  • cut(), from base R
  • case_when()
  • quantile breaks

Review distribution

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

First, examine the distribution of your data, to make appropriate cut-points. See the page on how to Plot continuous data.

# examine the distribution
hist(linelist$age_years)

summary(linelist$age_years, na.rm=T)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.16   23.00   90.00     100

CAUTION: Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years)..

age_categories()

With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). Of note: the output is an ordered factor.

Here are the required inputs:

  • A numeric vector (column)
  • The breakers = - a numeric vector of break points for the new groups

First, the most simple example:

# Simple example
################
pacman::p_load(epikit)

linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years,
      breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-69   70+  <NA> 
##  1170  1212  1009   855  1207   563   236    92    21    14   100

The break values you specify are by default included in the “higher” group - groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-40 41-50 51-60 61-70   71+  <NA> 
##  1430  1183  1002   796  1127   524   200    85    18    14   100

You can adjust how the labels are displayed with separator =. The default is “-”

You can adjust the upper cut-off of values allowed to be included in a group. Use ceiling =, the default is FALSE. If TRUE, the highest break value is a “ceiling” and a category “XX+” is not included. Any values above highest break value or upper (if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.

# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
      ceiling = TRUE)) # 70 is ceiling, all above become NA

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-70  <NA> 
##  1170  1212  1009   855  1207   563   236    92    21   114

Alternatively, instead of breakers =, you can provide all of lower =, upper =, and by =:

  • lower = The lowest number you want considered - default is 0
  • upper = The highest number you want considered
  • by = The number of years between groups
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      lower = 0,
      upper = 100,
      by = 10))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99  100+  <NA> 
##  2382  1864  1207   563   236    92    21    10     2     2     0   100

See the function’s Help page for more details (enter ?age_categories in the R console).

cut()

You can also use the base R function cut(), which creates categories from a numeric column. The differences from age_categories() are:

  • You do not need to install/load another package
  • You can specify whether groups are open/closed on the right/left
  • You must provide accurate labels yourself
  • If you want 0 included in the lowest group you must specify this

The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is an ordered factor. If used within mutate() (a dplyr verb) it is not necessary to specify the dataframe before the column name (e.g. linelist$age_years).

Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points.

  • Specify numeric vector of break points
  • Default behavior for cut() is that lower break values are excluded from each category, and upper break values are included. This is the opposite behavior from the age_categories() function.
  • Include 0 in the lowest category by adding include.lowest = TRUE
  • Add a vector of customized labels using the labels = argument
  • Check your work with cross-tabulation of the numeric and category columns - be aware of missing values

Below is a detailed description of the behavior of using cut() to make the age_cat column. Key points:

  • Inclusion/exclusion behavior of break points
  • Custom category labels
  • Handling missing values
  • Check your work!

A simple example of cut() applied to age_years to make the new variable age_cat is below:

# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>% 
  mutate(
    age_cat = cut(
      age_years,
      breaks = c(0, 5, 10, 15, 20,
                 30, 50, 70, 100),
      include.lowest = TRUE         # include 0 in lowest group
      ))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    [0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100]     <NA> 
##     1430     1183     1002      796     1127      724      103       14      100
  • By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). Reverse this behavior by providing the right = TRUE argument.

  • Thus, by default “0” values are excluded from the lowest group, and categorized as NA. “0” values could be infants coded as age 0. To change this add the argument include.lowest = TRUE. Then, any “0” values are included in the lowest group. The automatically-generated label for the lowest category will change from “(0,B]” to “[0,B]”, which signifies that 0 values are included.

  • Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 15-20).

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
##   0                    132      0       0       0       0       0       0        0    0
##   0.0833333333333333     2      0       0       0       0       0       0        0    0
##   0.166666666666667      1      0       0       0       0       0       0        0    0
##   0.25                   1      0       0       0       0       0       0        0    0
##   0.333333333333333      1      0       0       0       0       0       0        0    0
##   0.416666666666667      5      0       0       0       0       0       0        0    0
##   0.5                    5      0       0       0       0       0       0        0    0
##   0.583333333333333      2      0       0       0       0       0       0        0    0
##   0.666666666666667      1      0       0       0       0       0       0        0    0
##   0.75                   1      0       0       0       0       0       0        0    0
##   0.833333333333333      2      0       0       0       0       0       0        0    0
##   1                    248      0       0       0       0       0       0        0    0
##   1.5                    4      0       0       0       0       0       0        0    0
##   2                    261      0       0       0       0       0       0        0    0
##   3                    247      0       0       0       0       0       0        0    0
##   4                    257      0       0       0       0       0       0        0    0
##   5                    260      0       0       0       0       0       0        0    0
##   6                      0    251       0       0       0       0       0        0    0
##   7                      0    217       0       0       0       0       0        0    0
##   8                      0    238       0       0       0       0       0        0    0
##   9                      0    246       0       0       0       0       0        0    0
##   10                     0    231       0       0       0       0       0        0    0
##   11                     0      0     223       0       0       0       0        0    0
##   12                     0      0     169       0       0       0       0        0    0
##   13                     0      0     210       0       0       0       0        0    0
##   14                     0      0     176       0       0       0       0        0    0
##   15                     0      0     224       0       0       0       0        0    0
##   16                     0      0       0     148       0       0       0        0    0
##   17                     0      0       0     146       0       0       0        0    0
##   18                     0      0       0     185       0       0       0        0    0
##   19                     0      0       0     152       0       0       0        0    0
##   20                     0      0       0     165       0       0       0        0    0
##   21                     0      0       0       0     145       0       0        0    0
##   22                     0      0       0       0     156       0       0        0    0
##   23                     0      0       0       0     122       0       0        0    0
##   24                     0      0       0       0     127       0       0        0    0
##   25                     0      0       0       0     118       0       0        0    0
##   26                     0      0       0       0      92       0       0        0    0
##   27                     0      0       0       0     107       0       0        0    0
##   28                     0      0       0       0      79       0       0        0    0
##   29                     0      0       0       0      96       0       0        0    0
##   30                     0      0       0       0      85       0       0        0    0
##   31                     0      0       0       0       0      62       0        0    0
##   32                     0      0       0       0       0      58       0        0    0
##   33                     0      0       0       0       0      68       0        0    0
##   34                     0      0       0       0       0      58       0        0    0
##   35                     0      0       0       0       0      49       0        0    0
##   36                     0      0       0       0       0      48       0        0    0
##   37                     0      0       0       0       0      48       0        0    0
##   38                     0      0       0       0       0      40       0        0    0
##   39                     0      0       0       0       0      47       0        0    0
##   40                     0      0       0       0       0      46       0        0    0
##   41                     0      0       0       0       0      24       0        0    0
##   42                     0      0       0       0       0      24       0        0    0
##   43                     0      0       0       0       0      30       0        0    0
##   44                     0      0       0       0       0      20       0        0    0
##   45                     0      0       0       0       0      18       0        0    0
##   46                     0      0       0       0       0      21       0        0    0
##   47                     0      0       0       0       0      25       0        0    0
##   48                     0      0       0       0       0      17       0        0    0
##   49                     0      0       0       0       0      11       0        0    0
##   50                     0      0       0       0       0      10       0        0    0
##   51                     0      0       0       0       0       0      11        0    0
##   52                     0      0       0       0       0       0       9        0    0
##   53                     0      0       0       0       0       0      15        0    0
##   54                     0      0       0       0       0       0       8        0    0
##   55                     0      0       0       0       0       0      10        0    0
##   56                     0      0       0       0       0       0       4        0    0
##   57                     0      0       0       0       0       0       4        0    0
##   58                     0      0       0       0       0       0       7        0    0
##   59                     0      0       0       0       0       0      14        0    0
##   60                     0      0       0       0       0       0       3        0    0
##   61                     0      0       0       0       0       0       2        0    0
##   62                     0      0       0       0       0       0       3        0    0
##   63                     0      0       0       0       0       0       4        0    0
##   64                     0      0       0       0       0       0       3        0    0
##   65                     0      0       0       0       0       0       1        0    0
##   66                     0      0       0       0       0       0       2        0    0
##   67                     0      0       0       0       0       0       1        0    0
##   68                     0      0       0       0       0       0       2        0    0
##   71                     0      0       0       0       0       0       0        2    0
##   72                     0      0       0       0       0       0       0        1    0
##   73                     0      0       0       0       0       0       0        1    0
##   74                     0      0       0       0       0       0       0        1    0
##   75                     0      0       0       0       0       0       0        2    0
##   76                     0      0       0       0       0       0       0        2    0
##   78                     0      0       0       0       0       0       0        1    0
##   83                     0      0       0       0       0       0       0        1    0
##   87                     0      0       0       0       0       0       0        1    0
##   90                     0      0       0       0       0       0       0        2    0
##   <NA>                   0      0       0       0       0       0       0        0  100

Reverse break inclusion behavior in cut()

Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.

NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.

linelist <- linelist %>% 
  mutate(
    age_cat = cut(
      age_years,
      breaks = c(0, 5, 10, 15, 20,
                 30, 50, 70, 100),  # same breaks as above
      right = FALSE,                # include each *lower* break point
      include.lowest = TRUE         # include *highest* value *highest* group
      ))                                                 

table(linelist$age_cat, useNA = "always")
## 
##    [0,5)   [5,10)  [10,15)  [15,20)  [20,30)  [30,50)  [50,70) [70,100]     <NA> 
##     1170     1212     1009      855     1207      799      113       14      100

Add labels

As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below. Below is the same code as above, with manual labels added.

linelist <- linelist %>% 
  mutate(
    age_cat = cut(
      age_years,
      breaks = c(0, 5, 10, 15, 20,
                 30, 50, 70, 100),  # same breaks as above
      right = FALSE,                # include each *lower* break point
      include.lowest = TRUE,        # include *highest* value *highest* group
      labels = c("0-4", "5-9", "10-14",
                 "15-19", "20-29", "30-49",
                 "50-69", "70-100")
      ))

table(linelist$age_cat, useNA = "always")
## 
##    0-4    5-9  10-14  15-19  20-29  30-49  50-69 70-100   <NA> 
##   1170   1212   1009    855   1207    799    113     14    100

Re-labeling NA values with cut()

Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified the new column age_cat as class Factor (a rigid class limited to the defined values).

First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (e.g. “Missing”). Otherwise you will encounter an error. Then, use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These steps can be combined into one step, as shown below.

Note that Missing has been added, but the order of the categories is now wrong (alphabetical considering numbers as characters).

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,
                          include.lowest = TRUE,        
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"))


table(linelist$age_cat, useNA = "always")
## 
##     0-4   10-14   15-19   20-29   30-49     5-9   50-69  70-100 Missing    <NA> 
##    1170    1009     855    1207     799    1212     113      14     100       0

To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,
                          include.lowest = TRUE,        
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"),
         
         # re-classify age_cat as Factor, with correct level order and new "Missing" level
         age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
                                              "30-49", "50-69", "70-100", "Missing")))    
  

table(linelist$age_cat, useNA = "always")
## 
##     0-4     5-9   10-14   15-19   20-29   30-49   50-69  70-100 Missing    <NA> 
##    1170    1212    1009     855    1207     799     113      14     100       0

If the above seems cumbersome, consider using age_categories() instead, as described before.

Make breaks and labels

For a fast way to make breaks and labels manually, use something like below. See the R Basics page for references on seq() and rep().

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

Read more about cut() in its Help page by entering ?cut in the R console.

Quantile breaks

Make breaks from quantile(). This is from the stats package which comes in base R.

age_quantiles <- quantile(linelist$age_years, c(0, .25, .50, .75, .90, .95), na.rm=T)
age_quantiles
##  0% 25% 50% 75% 90% 95% 
##   0   6  13  23  34  40
# to return only the numbers use unname()
age_quantiles <- unname(age_quantiles)
age_quantiles
## [1]  0  6 13 23 34 40

You can then use these as break points in age_categories() or cut().

case_when()

The dplyr function case_when() can also be used to create numeric categories.

  • Allows explicit setting of break point inclusion/exclusion
  • Allows designation of label for NA values in one step
  • More complicated code
  • Allow more flexibility to include other variables in the logic

If using case_when() please review the proper use as described earlier in this page, as logic and order of assignment are important understand to avoid errors.

CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (either “Missing”, or the special NA_character_ instead of NA).

You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.

linelist <- linelist %>% 
  mutate(
    age_cat = factor(case_when(
      # provide the case_when logic and outcomes
      age_years >= 0 & age_years < 5     ~ "0-4",          
      age_years >= 5 & age_years < 10    ~ "5-9",
      age_years >= 10 & age_years < 15   ~ "10-14",
      age_years >= 15 & age_years < 20   ~ "15-19",
      age_years >= 20 & age_years < 30   ~ "20-29",
      age_years >= 30 & age_years < 50   ~ "30-49",
      age_years >= 50 & age_years < 70   ~ "50-69",
      age_years >= 45 & age_years <= 100 ~ "70-100",
      is.na(age_years)                   ~ "Missing",      # if age_years is missing
      TRUE                               ~ "Check value"), # trigger for review
      
      # define the levels order for factor()
      levels = c("0-4","5-9", "10-14",
                 "15-19", "20-29", "30-49",
                 "50-69", "70-100", "Missing", "Check value")))

And now view the results with a table of the new column:

table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49       50-69      70-100     Missing Check value        <NA> 
##        1170        1212        1009         855        1207         799         113          14         100           0           0

Add to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))

Add rows

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666,
          case_id = "abc",
          generation = 4,
          `infection date` = as.Date("2020-10-10"),
          .before = 2)

Use .before and .after. to place the row you want to add. .before = 3 will put the new row before the 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty.

The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.

If a class you provide is off you will see an error like this:

Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.

(when inserting a row with a date value, remember to wrap the date in the function as.Date() like as.Date("2020-10-10")).

Filter rows

A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

Below is shown how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

Simple filter()

This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.

In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6479.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6474.

Complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this raw dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

How filters handle missing numeric and date values

Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

(See the page on Working with dates for more information on working with dates and the package lubridate)

Design the filter

Examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  359   95    0
##   Hospital A                            248   40    0    0    0
##   Hospital B                            246   42    0    0    0
##   Military Hospital                       0    0  698  198    0
##   Missing                                 0    0 1159  310    0
##   Other                                   0    0  713  172    0
##   Port Hospital                          10    0 1423  339    0
##   St. Mark's Maternity Hospital (SMMH)    0    0  331   91    0
##   <NA>                                    0    0    0    0    0

What other criteria can we filter on to remove the first outbreak (in 2012 & 2013) from the dataset? We see that:

  • The first epidemic in 2012 & 2013 occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
  • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.

We want to exclude:

  • The 586 rows with onset in 2012 and 2013 at either hospital A, B, or Port:
    • Exclude 586 rows with onset in 2012 and 2013
    • Exclude 0 rows from Hospitals A & B with missing onset dates
    • Do not exclude 0 other rows with missing onset dates.

We start with a linelist of nrow(linelist). Here is our filter statement:

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 5888

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      359   95    0
##   Military Hospital                     698  198    0
##   Missing                              1159  310    0
##   Other                                 713  172    0
##   Port Hospital                        1423  339    0
##   St. Mark's Maternity Hospital (SMMH)  331   91    0
##   <NA>                                    0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete with no missing values. This is true. But date_onset is used for purposes of demonstrating a complex filter.

Standalone

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

Quickly review records

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [ ] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View the linelist in the viewer panel:

View(linelist)

View specific data for three cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Note: the above command can also be written with dplyr verbs filter() and select() as below:

View(linelist %>%
       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
       select(date_onset, date_hospitalisation))

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
    
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

Row-wise calculations

If you want to perform a calculation within a row, you can use rowwise() from dplyr. See the vignette on row-wise calculations

For example, this code applies rowwise() and then creates a new column that sums the number of symptoms per case:

linelist <- linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))
## [1] "2014-04-17"
## [1] "2014-04-19"

Working with dates

Overview

Working with dates in R is notoriously difficult when compared to other object classes. R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many date formats, some of which can be confused for other formats. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages.

Dates in R are their own class of object - the Date class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as and/or POSIXt, POSIXct, and/or POSIXlt classes (the difference isn’t important). These objects are informally referred to as datetime classes.

You can get the system date or system datetime by doing the following:

# get the system date - this is a DATE class
Sys.Date()
## [1] "2021-03-10"
# get the system time - this is a DATETIME class
Sys.time()
## [1] "2021-03-10 09:35:58 EST"
  • It is important to make R recognize when a column contains dates.
  • Dates are an object class and can be tricky to work with.
  • Here we present several ways to convert date columns to Date class.

Packages

The following packages are recommended for working with dates:

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(aweek,      # flexibly converts dates to weeks, and vis-versa
               lubridate,  # for conversions to months, years, etc.
               linelist,   # function to guess messy dates
               ISOweek,    # another option for creating weeks
               tidyverse,
               )    

Convert to Date class

as.Date()

The standard, base R function to convert an object or column to class Date is as.Date() (note capitalization).

as.Date() requires that the user specify the existing format of the date, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().

If used on a column, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them manually or consider using guess_dates() from the linelist package.

It can be easiest to first convert the column to character class, and then convert to date class:

  1. Turn the column into character values using the function as.character()
# With pipes
linelist <- linelist %>% 
  mutate(date_onset = as.character(date_onset))

# In base R
linelist$date_onset <- as.character(linelist$date_onset)
  1. Convert the column from character values into date values, using the function as.Date()
    (note the capital “D”)
  • Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.

    • The codes are:
      %d = Day # (of the month e.g. 16, 17, 18…)
      %a = abbreviated weekday (Mon, Tues, Wed, etc.)
      %A = full weekday (Monday, Tuesday, etc.)
      %m = # of month (e.g. 01, 02, 03, 04)
      %b = abbreviated month (Jan, Feb, etc.)
      %B = Full Month (January, February, etc.)
      %y = 2-digit year (e.g. 89)
      %Y = 4-digit year (e.g. 1989)

For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.

# Using pipes
linelist <- linelist %>% 
  mutate(date_onset = as.Date(date_of_onset, format = "%d/%m/%Y"))

# Using base R
linelist$date_onset <- as.Date(linelist$date_of_onset, format = "%d/%m/%Y"))

TIP: The format = argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.

TIP:Be sure that in the format = argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

lubridate

Conveting character objects to dates can be made far easier by using the lubridate package. This is a tidyverse package designed to make working with dates and time more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.

The lubridate package provides several different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date(). These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.

# install/load lubridate 
pacman::p_load(lubridate)

The ymd() function flexibly converts date values supplied as year, then month, then day.

# read date in year-month-day format
ymd("2020-10-11")
## [1] "2020-10-11"
ymd("20201011")
## [1] "2020-10-11"

The mdy() function flexibly converts date values supplied as month, then day, then year.

# read date in month-day-year format
mdy("10/11/2020")
## [1] "2020-10-11"
mdy("Oct 11 20")
## [1] "2020-10-11"

The dmy() function flexibly converts date values supplied as day, then month, then year.

# read date in day-month-year format
dmy("11 10 2020")
## [1] "2020-10-11"
dmy("11 October 2020")
## [1] "2020-10-11"

If using piping and the tidyverse, the converting a character column to dates with lubridate might look like this:

linelist <- linelist %>%
  mutate(date_onset = lubridate::dmy(date_onset))

Once complete, you can run a command to verify the class of the column

# Check the class of the column
class(linelist$date_onset)  

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

Convert to datetime classes

As previously mentioned, R also supports a datetime class - a column that contains date and time information. As with the Date class, these often need to be converted from character objects to datetime objects.

A standard datetime object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied.

Luckily, lubridate helper functions also exist to help convert these strings to datetime objects. These functions are extensions of the date helper functions, with _h (only hours supplied), _hm (hours and minutes supplied), or _hms (hours, minutes, and seconds supplied) appended to the end (e.g. dmy_hms()). These can be used as shown:

Convert datetime with only hours to datetime object

ymd_h("2020-01-01 16hrs")
## [1] "2020-01-01 16:00:00 UTC"
ymd_h("2020-01-01 4PM")
## [1] "2020-01-01 16:00:00 UTC"

Convert datetime with hours and minutes to datetime object

dmy_hm("Jan 1st 2020 16:20")
## Warning: All formats failed to parse. No formats found.
## [1] NA

Convert datetime with hours, minutes, and seconds to datetime object

mdy_hms("01 January 20, 16:20:40")
## Warning: All formats failed to parse. No formats found.
## [1] NA

You can supply time zone but it is ignored. See section later in this page on time zones.

mdy_hms("01 January 20, 16:20:40 PST")
## Warning: All formats failed to parse. No formats found.
## [1] NA

When working with a dataframe, time and date columns can be combined to create a datetime column using str_glue() from stringr package and an appropriate lubridate function:

# packages
pacman::p_load(tidyverse, lubridate, stringr)

# time_admission is a column in hours:minutes
linelist <- linelist %>%
  
  # when time of admission is not given, assign the median admission time
  mutate(
    time_admission_clean = ifelse(
      is.na(time_admission),
      median(time_admission),
      time_admission
  ) %>%
  
    # use str_glue() to combine two columns to create a character column
    # and then use ymd_hm() to convert to datetime
  mutate(
    date_time_of_admission = str_glue("{date_hospitalisation} {time_admission_clean}") %>% 
      ymd_hm()
  )

Working with dates

lubridate can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals

Here we define a date to use for the examples:

# create object of class Date
example_date <- ymd("2020-03-01")
# extract the month and day from this date
month(example_date)  # month number
## [1] 3
day(example_date)    # day (number)
## [1] 1
wday(example_date)   # day number of the week (1-7)
## [1] 1

You can also extract time components from a datetime object or column. This can be useful if you want to view the distribution of admission times.

example_datetime <- ymd_hm("2020-03-01 14:45")

hour(example_datetime)     # extract hour
minute(example_datetime)   # extract minute
second(example_datetime)   # extract second

You can retrieve epiweeks with epiweek() from lubridate:

# get the epiweek of this date (this will be expanded later)
epiweek(example_date)
## [1] 10

Date math

# add 3 days to this date
example_date + days(3)
## [1] "2020-03-04"
# add 7 weeks and subtract two days from this date
example_date + weeks(7) - days(2)
## [1] "2020-04-17"

Date intervals

# find the interval between this date and Feb 20 2020 
example_date - ymd("2020-02-20")
## Time difference of 10 days

This can all be brought together to work with data - for example:

pacman::p_load(lubridate, tidyverse)   # load packages

linelist <- linelist %>%
  
  # convert date of onset from character to date objects by specifying dmy format
  mutate(date_onset = dmy(date_onset),
         date_hospitalisation = dmy(date_hospitalisation)) %>%
  
  # filter out all cases without onset in march
  filter(month(date_onset) == 3) %>%
    
  # find the difference in days between onset and hospitalisation
  mutate(onset_to_hosp_days = date_hospitalisation - date_of_onset)

Messy dates

The function guess_dates() from the linelist package attempts to read a “messy” date column containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(). If guess_dates() is not yet available on CRAN for R 4.0.2, try install via pacman::p_load_gh("reconhub/linelist").

For example guess_dates would see a vector of the following character dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them to class Date as: 2018-01-03, 1982-03-07, and 1985-08-20.

linelist::guess_dates(c("03 Jan 2018",
                        "07/03/1982",
                        "08/20/85"))
## [1] "2018-01-03" "1982-03-07" "1985-08-20"

Some optional arguments for guess_dates() that you might include are:

  • error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)
  • last_date - the last valid date (defaults to current date)
  • first_date - the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the column dater_onset
linelist <- linelist %>%                 # the dataset is called linelist
  mutate(
    date_onset = linelist::guess_dates(  # the guess_dates() from package "linelist"
      date_onset,
      error_tolerance = 0.1,
      first_date = "2016-01-01"
    )

Excel Dates

Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() or lubridate’s as_date() function to convert, but instead of supplying a format as above, supply an origin date. This will not work if the excel date is read as a character type, so be sure to ensure the date is a numeric class (or convert it to one)!

NOTE: You should provide the origin date in R’s default date format (“YYYY-MM-DD”).

library(lubridate)
library(dplyr)

# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>% 
  mutate(date_onset = as_date(as.double(date_onset), origin = "1899-12-30")) # convert to numeric, then convert to date

Date display

Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the base R function format(), which works in a similar way as as.Date(). Read more in this online tutorial.

Remember that the output from format() is a character type, so is generally used for display purposes only!

%d = Day # (of the month e.g. 16, 17, 18…) %a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)

An example of formatting today’s date:

# today's date, with formatting
format(Sys.Date(), format = "%d %B %Y")
## [1] "10 March 2021"
# easy way to get full date and time (no formatting)
date()
## [1] "Wed Mar 10 09:36:00 2021"
# formatted date, time, and time zone (using paste0() function)
paste0(
  format(Sys.Date(), format = "%A, %b %d '%y, %z  %Z, "), 
  format(Sys.time(), format = "%H:%M:%S")
)
## [1] "Wednesday, Mar 10 '21, +0000  UTC, 09:36:00"

Calculate distance between dates

The difference between dates can be calculated by:

  1. Correctly formating both date column as class date (see instructions above)
  2. Creating a new column that is defined as one date column subtracted from the other
  3. Converting the result to numeric class (default is class “datediff”). This ensures that subsequent mathematical calculations can be performed.
# define columns as date classes
date_of_onset <- ymd("2020-03-16")
date_lab_confirmation <- ymd("2020-03-20")

# find the delay between onset and lab confirmation
days_to_lab_conf <- as.double(date_lab_confirmation - date_of_onset)
days_to_lab_conf
## [1] 4

In a dataframe format (i.e. when working with a linelist), if either of the above dates is missing, the operation will fail for that row. This will result in an NA instead of a numeric value. When using this column for calculations, be sure to set the na.rm option to TRUE. For example:

# add a new column
# calculating the number of days between symptom onset and patient outcome
linelist_delay <- linelist_cleaned %>%
  mutate(
    days_onset_to_outcome = as.double(date_of_outcome - date_of_onset)
  )

# calculate the median number of days to outcome for all cases where data are available
med_days_outcome <- median(linelist_delay$dats_onset_to_outcome, na.rm = T)

# often this operation might be done only on a subset of data cases, e.g. those who died
# this is easy to look at and will be explained later in the handbook

Converting dates/time zones

When data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.

In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date column represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.

To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

# assign the current time to a column
time_now <- Sys.time()
time_now
## [1] "2021-03-10 09:36:00 EST"
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
time_london_real <- with_tz(time_now, "Europe/London")

# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
time_london_local <- force_tz(time_now, "Europe/London")


# note that as long as the computer that was used to run this code is NOT set to London time, there will be a difference in the times (the number of hours difference from the computers time zone to london)

time_london_real - time_london_local
## Time difference of 5 hours

This may seem largely abstract, and is often not needed if the user isn’t working across time zones. One simple example of its implementation is:

# TODO add when time column is here
# set the time column to time zone for ebola outbreak 

# "Africa/Lubumbashi" is the time zone for eastern DRC/Kivu Nord

Epidemiological weeks

Use the floor_date() function from lubridate, with unit = "week". See example below for specifying the week start day. The returned output is the start date of the week, in Date class.

For example, to create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.

To aggregate into weeks and show ALL weeks (even ones with no cases), do this:

  1. Create a new ‘week’ column within mutate(), using floor_date() from the lubridate package:
    • use unit = to set the desired time unit, e.g. "week`
    • use week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)
  2. Follow with complete() to ensure that all weeks appear - even those with no cases.

For example:

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(
    week = lubridate::floor_date(date_onset,
                                 unit = "week")) %>%  # new column of week of onset
  count(week) %>%                                     # group data by week and count rows per group
  filter(!is.na(week)) %>%                            # remove entries for cases missing date_onset
  complete(week = seq.Date(from = min(week),          # fill-in all weeks with no cases reported
                           to = max(week),
                           by="week"))

Here are the first 20 rows of the resulting dataframe:

You can also use the package aweek to set epidemiological weeks. You can read more about it on the RECON website

Dates in Epicurves

See the section on Epidemic curves.

Lagging and leading calculations

lead() and lag() are functions from the dplyr package which help find previous (lagged) or subsequent (leading) values in a vector - typically a numeric or date vector. This is useful when doing calculations of change/difference between time units.

Let’s say you want to calculate the difference in cases between a current week and the previous one. The data are initially provided in weekly counts as shown below. To learn how to aggregate counts from daily to weekly see the page on aggregating (LINK).

When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending

First, create a new column containing the value of the previous (lagged) week.

  • Control the number of units back/forward with n = (must be a non-negative integer)
  • Use default = to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this is NA.
  • Use order_by = TRUE if your reference column is not ordered
counts <- counts %>% 
  mutate(cases_prev_wk = lag(cases_wk, n = 1))

Next, create a new column which is the difference between the two cases columns:

counts <- counts %>% 
  mutate(cases_prev_wk = lag(cases_wk, n = 1),
         case_diff = cases_wk - cases_prev_wk)

You can read more about lead() and lag() in the documentation here or by entering ?lag in your console.

Dates miscellaneous

  • Sys.Date( ) from base R returns the current date of your computer
  • Sys.Time() from base R returns the current time of your computer
  • date() from lubridate returns the current date and time.

Resources

lubridate tidyverse page
lubridate RStudio cheatsheet
R for Data Science page on dates and times
Online tutorial

Factors

In R, factors allow for ordered categorical data. A column can be converted from class numeric, categorical, or even logical to class factor. In this case, the values are stored as ordered integer levels, and can display with assigned labels.

In a column of class factor:

  • the possible values are restricted - values not already defined as levels are rejected
  • values are ordered, which impacts how they display in tables and plots

Most of this page will use functions from the package forcats (a short name for “For categorical variables”).

Factors are useful in statistical modeling, which allows integer values such as 1/0 to be evaluated categorically and not continuously.

Preparation

Load packages

Below are the packages used in this page

pacman::p_load(
  rio,           # import/export
  here,          # filepaths
  lubridate,     # working with dates
  forcats,       # factors
  tidyverse      # data mgmt and viz
  )

Load data

In this page we demonstrate using the linelist loaded below with import() (see page on Import and Export).

# fake import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Create new column

We begin in a common epidemiological scenario - the creation of a categorical column. In this case, we use the existing column days_onset_hosp (days from symptom onset to hospital admission) to classify rows into categorical groups.

The dplyr function case_when() is used to apply logical criteria on each row, resulting in values for the new column delay

linelist <- linelist %>% 
  mutate(delay = case_when(
    days_onset_hosp < 2                        ~ "<2 days",
    days_onset_hosp >= 2 & days_onset_hosp < 5 ~ "2-5 days",
    days_onset_hosp >= 5                       ~ ">5 days",
    is.na(days_onset_hosp)                     ~ NA_character_,
    TRUE                                       ~ "Check me"))  

This is a column of character categorical values, but is not yet classified as a factor. Thus, in a frequency table, we see that the values appear in a default alphabetical order - an order that does not make much intuitive sense:

table(linelist$delay, useNA = "always")
## 
##  <2 days  >5 days 2-5 days     <NA> 
##     3183      602     2103        0

Likewise, if we make a bar plot the values also appear in this order from the bottom:
This probably does not make sense to show.

ggplot(data = linelist, aes(x = delay))+
  geom_bar()+
  theme_classic()

Convert to factor

To initially convert a column to class factor, use the base R function factor(). Below the dataframe linelist is modified such that the column time_period is converted to a factor.

linelist <- linelist %>%
  mutate(delay = factor(delay))

Unless specified, the levels will still be in alphabetic (or numeric) order. Use the base R function levels() to see how the levels of time_period are ordered. Note that NA is not a factor level.

levels(linelist$delay)
## [1] "<2 days"  ">5 days"  "2-5 days"

Adjust level order

Using the package forcats, there are several functions to adjust the order of a factor’s levels:

  • Use fct_relevel() to manually adjust the order
  • Use fct_infreq() to reorder by frequency (highest to lowest)
  • Use fct_inorder() to reorder by order of appearance in the data
  • Use fct_reorder() to reorder by another column (e.g. order time_period levels by their row’s median delay to admission)
  • Use fct_rev() to reverse the existing order
  • Use fct_reorder2() to reorder by the final values when plotted with two other columns

These functions can be applied outside of a plot to re-define the column, or within a plot to affect just one specific plot.

Examples

fct_relevel()

This function is used to manually assign factor levels. If desired, you can write all the levels in the desired order. However, it is not necessary to specify the order of all levels - you can adjust the order only certain levels using the after = argument.

Here are examples of redefining the column, with mew order of levels:

# re-define level order
linelist <- linelist %>% 
  mutate(delay = fct_relevel(delay, c("<2 days", "2-5 days", ">5 days")))


# using base R 
linelist$delay <- fct_relevel(linelist$delay, c("<2 days", "2-5 days", ">5 days"))

Alternatively, you can adjust the levels of a factor from within a plot, and the re-ordering of the levels will only apply within the plot. Below, you can see how the specified order begins from the bottom to the top. The missing value NA are at the end.

ggplot(data = linelist, aes(x = fct_relevel(delay, c("<2 days", "2-5 days", ">5 days", "Missing"))))+
  geom_bar()

Note how the default x-axis label is now quite complicated - you can overwrite this with the labs() in ggplot.

fct_infreq()

To order by frequency that the value appears in the data, use fct_infreq(). Any missing values (NA) will automatically be included at the end.

You can reverse the order by wrapping with fct_rev(), like this: fct_rev(fct_infreq(time_period)).

# ordered by frequency
ggplot(data = linelist, aes(x = fct_infreq(delay)))+
  geom_bar()+
  labs(x = "Delay onset to admission (days)")

# reversed frequency
ggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay))))+
  geom_bar()+
  labs(x = "Delay onset to admission (days)")

fct_reorder()

Use this function to order the levels by another column. For example, to order boxplots showing delay by the median CT value of each delay group.

In the examples below, the x-axis if delay group, and the y = axis is CT value. The boxplots are also colored by delay group.

In the first example, the baseline order of the levels applies (as set earlier in this page) - they increase incrementally updward by delay.
In the second example, the x-axis column has been wrapped in fct_reorder(), with the column ct_blood as the second argument. The default is order delay by the median ct_value. An alternative function can be supplied, e.g. “mean”, or “max”.

Note there are no explicit grouping steps required prior to the ggplot() - the grouping and calculations are all done internally.

# boxplots ordered by original factor levels
ggplot(data = linelist)+
  geom_boxplot(
    aes(x = delay,
        y = ct_blood, 
        fill = delay))+
  labs(x = "Delay onset to admission (days)",
       title = "Ordered by increasing delay (original factor levels)")+
  theme_classic()+
  theme(legend.position = "none")

# boxplots ordered by median CT value
ggplot(data = linelist)+
  geom_boxplot(
    aes(x = fct_reorder(delay, ct_blood, "median"),
        y = ct_blood,
        fill = delay))+
  labs(x = "Delay onset to admission (days)",
       title = "Ordered by median CT value in group")+
  theme_classic()+
  theme(legend.position = "none")

fct_reorder2()

Use this function to order the legend colors by the vertical order of groups at the “end” of the plot. For example, if you have lines showing case counts by hospital over time, you can apply fct_reorder2() to the color = argument within aes(), such that the vertical order of hospitals appearing in the legend aligns with the order of lines at the terminal end of the plot. Read more in the function documentation.

linelist %>%         # begin with the linelist            
  count(             # summarise so n = counts of rows by epiweek and by hospital
    epiweek = lubridate::floor_date(date_onset, "week"),  # create and group by epiweeks
    hospital         # also group by hospital
    ) %>% 
  
  ggplot()+           # start plot
  geom_line(          # make lines
    aes(x = epiweek,  # x-axis epiweek
        y = n,        # height in number of rows
        color = fct_reorder2(hospital, epiweek, n)))+ # grouped by hospital and colors ordered by n value at end of plot
  labs(color = "Hospital")  # change legend title

fct_lump()

To “lump” together many low-frequency levels into an “Other” group, you can use this function. Do one of the following:

  • Set n = argument as the number of groups you want to keep. All other values will combine into “Other”.
  • set prop = argument as the proportion above which you want to keep. All other values will combine into “Other”.

You can also change the label for “Other” by using other_level =. Below, all but the two most-frequent hospitals are combined into “Other hospitals”.

ggplot(data = linelist)+
  geom_bar(aes(x = fct_lump(hospital,    # column for x-axis
                            n = 2,       # keep two most-frequent levels
                            other_level = "Other hospitals"))) # label for "Other" group

You can also use fct_other() to manually assign factor levels to an “Other” group. Below, all hospital values aside from “Port Hospital” and “Central Hospital” are combined into “Other”.

You can use the arguments keep =, or drop =, and can change the label of “Other” with other_label =.

linelist %>% 
  mutate(hospital = fct_other(hospital, keep = c("Port Hospital", "Central Hospital"))) %>% 
  select(hospital) %>% 
  table()
## .
## Central Hospital    Port Hospital            Other 
##              454             1762             3672

Missing values

If you have NA values in your column, you can easily convert them to a named value such as “Missing” with fct_explicit_na(), as performed below temporarily on the column delay:

linelist %>% 
  mutate(delay = fct_explicit_na(delay, na_level = "Missing")) %>% 
  select(delay) %>% 
  table(useNA = "always")
## .
##  <2 days 2-5 days  >5 days     <NA> 
##     3183     2103      602        0

Edit labels

Adjust the factor labels with fct_recode(). remember that these do not change the underlying values, only their labels.
Below, the labels of the factor column delay (grouped days from onset to admission) are edited:

The old labels:

table(linelist$delay, useNA = "always")
## 
##  <2 days 2-5 days  >5 days     <NA> 
##     3183     2103      602        0

Now the labels are changed, using syntax fct_recode(column, "new" = "old","new" = "old", "new" = "old"). Remember that NA is not a formal level unless changed (e.g. with fct_explicit_na() as shown above).

linelist <- linelist %>% 
  mutate(delay = fct_recode(delay,
                            "Less than 2 days" = "<2 days",
                            "2 to 5 days"      = "2-5 days",
                            "More than 5 days" = ">5 days"))

table(linelist$delay)
## 
## Less than 2 days      2 to 5 days More than 5 days 
##             3183             2103              602

Add/drop levels

If you have a factor and want to add levels (regardless of whether there are any rows with those values), use fct_expand().

See how if we classify “hospital” as a factor, and then try to change the values, an error is returned:

linelist <- linelist %>% 
  mutate(hospital = factor(hospital))

levels(linelist$hospital)
## [1] "Central Hospital"                     "Military Hospital"                    "Missing"                             
## [4] "Other"                                "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"

Now we can add the level “University Hospital”:

linelist <- linelist %>% 
  mutate(hospital = fct_expand(hospital, "University Hospital"))

levels(linelist$hospital)
## [1] "Central Hospital"                     "Military Hospital"                    "Missing"                             
## [4] "Other"                                "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"
## [7] "University Hospital"

Epiweeks

TBD - under construction

Epiweeks produced by aweek will appear as an ordered factor

a <- linelist %>% 
  mutate(epiweek = aweek::date2week(date_onset, ))


a <- linelist %>% 
  mutate(epiweek = factor(lubridate::isoweek(date_onset)))

levels(a$epiweek)

Resources

R for Data Science page on factors.

Pivoting data

When manipulating data, pivoting can be understood to refer to one of two processes:

  1. the creation of pivot tables, which are tables “… of statistics that summarize the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way… They arrange and rearrange (or”pivot“) statistics in order to draw attention to useful information. This leads to finding figures and facts quickly making them integral to data analysis.” see wiki.
  2. The conversion of a table from long to wide format, or vice versa.

In this page, we will focus on the latter definition. The former is a crucial step in data analysis, and is covered elsewhere in the Grouping data and [Descriptive statistics] pages.

Wide-to-long

Transforming a dataset from wide to long (image source)

Data

Data are often entered and stored in a format that might be useful for presentation, but not for analysis. Let us take the count_data dataset as an example, which is stored in a “wide” format, which means that each column is a variable and each row an observation. This is useful for presenting the information in a table or for entering data (e.g. in Excel) from case report forms. However, these typically needs to be transformed to “long” format in order to analyse and visualise.

count_data <- import("facility_count_data.rds")

Each observation in this dataset refers to the malaria counts at one of 65 facilities on a given date, ranging from 2019-03-18 to 2019-06-14. These facilties are located in one Province (North) and four Districts (Spring, Bolo, Dingo, and Barnard). The dataset provides the overall counts of malaria, as well as age-specific counts in each of three age groups - <4 years, 5-14 years, and 15 years and older.

Visualising the overall malaria counts over time poses no difficulty with the data in it’s current format:

ggplot(count_data) +
  geom_col(aes(x = data_date, y = malaria_tot))

However, what if we wanted to display the relative contributions of each age group to this total count? In this case, we need to ensure that the variable of interest (age group), appears in the dataset in a single column that can be passed to {ggplot2}’s “aesthetics” (aes()) function.


Consider also using the common problem whereby data are stored with dates as the columns, as in the example dataset tidyr::table4a

tidyr::table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

pivot_longer()

First, let’s begin by loading our packages and converting count_data to a tibble for easy printing:

pacman::p_load(tidyverse)

# Convert count_data to `tibble` for better printing
count_data <- 
  count_data %>% 
  as_tibble() 

count_data
## # A tibble: 3,038 x 10
##    location_name data_date  submitted_date Province District `malaria_rdt_0-4` `malaria_rdt_5-14` malaria_rdt_15 malaria_tot newid
##    <chr>         <date>     <date>         <chr>    <chr>                <int>              <int>          <int>       <int> <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring                  11                 12             23          46     1
##  2 Facility 2    2019-06-13 2019-06-14     North    Bolo                    11                 10              5          26     2
##  3 Facility 3    2019-06-13 2019-06-14     North    Dingo                    8                  5              5          18     3
##  4 Facility 4    2019-06-13 2019-06-14     North    Bolo                    16                 16             17          49     4
##  5 Facility 5    2019-06-13 2019-06-14     North    Bolo                     9                  2              6          17     5
##  6 Facility 6    2019-06-13 2019-06-14     North    Dingo                    3                  1              4           8     6
##  7 Facility 6    2019-06-12 2019-06-14     North    Dingo                    4                  0              3           7     6
##  8 Facility 5    2019-06-12 2019-06-14     North    Bolo                    15                 14             13          42     5
##  9 Facility 5    2019-06-11 2019-06-14     North    Bolo                    11                 11             13          35     5
## 10 Facility 5    2019-06-10 2019-06-14     North    Bolo                    19                 15             15          49     5
## # ... with 3,028 more rows

Next, we want to use {tidyr}’s pivot_longer() function to convert the wide dataset to a long format, converting the four columns with data on malaria counts to two new columns: one which captures the variable name and one which captures the values from the cells. Since these four variables all begin with the prefix malaria_, we can make use of the handy function starts_with().

df_long <- 
  count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_")
  )

df_long
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name             value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>            <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_0-4     11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_5-14    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_15      23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_tot         46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_0-4     11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_5-14    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_15       5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_tot         26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_0-4      8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_5-14     5
## # ... with 12,142 more rows

However, we could also have specified the columns by position:

count_data %>% 
  pivot_longer(
    cols = 6:9
  )
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name             value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>            <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_0-4     11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_5-14    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_15      23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_tot         46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_0-4     11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_5-14    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_15       5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_tot         26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_0-4      8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_5-14     5
## # ... with 12,142 more rows

or by named range:

count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_tot
  )
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name             value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>            <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_0-4     11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_5-14    12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_15      23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_tot         46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_0-4     11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_5-14    10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_15       5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_tot         26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_0-4      8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_5-14     5
## # ... with 12,142 more rows

Notice that the newly created dataframe (df_long) has more rows (12,152 vs 3,038); it has become longer. In fact, it is precisely four times as long, because each row in the original dataset now represents four rows in df_long, one for each of the malaria count observations (<4y, 5-14y, 15y+, and total).

In addition to becoming longer, the new dataset has fewer columns (8 vs 10), as the data previously stored in four columns (those beginning with the prefix malaria_) is now stored in two. These two columns are given the default names of name and value, but we can override these defaults to provide more meaningful names, which can help remember what is stored within, using the names_to and values_to arguments. Let’s use the names age_group and count:

df_long <- 
  count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_"),
    names_to = "age_group",
    values_to = "counts"
  )

df_long
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid age_group        counts
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>             <int>
##  1 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_0-4      11
##  2 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_5-14     12
##  3 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_rdt_15       23
##  4 Facility 1    2019-06-13 2019-06-14     North    Spring       1 malaria_tot          46
##  5 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_0-4      11
##  6 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_5-14     10
##  7 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_rdt_15        5
##  8 Facility 2    2019-06-13 2019-06-14     North    Bolo         2 malaria_tot          26
##  9 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_0-4       8
## 10 Facility 3    2019-06-13 2019-06-14     North    Dingo        3 malaria_rdt_5-14      5
## # ... with 12,142 more rows

We can now pass this new dataset to {ggplot2} to display the malaria counts by age group:

ggplot(df_long) +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Have a look at the plot - what is wrong here? We have encountered a common problem - we have also included the total counts from the malaria_tot column, so the magnitude of each bar in the plot is twice as high as it should be.

We can handle this in a number of ways. We could simply filter it from the dataset we pass to {ggplot2}:

df_long %>% 
  filter(age_group != "malaria_tot") %>% 
  ggplot() +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Alternatively, we could have excluded this variable when we ran pivot_longer, thereby maintaining it in the dataset as a separate variable:

count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_rdt_15,
    names_to = "age_group",
    values_to = "counts"
  ) %>% 
  ggplot() +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group)
  )

Long-to-wide

Transforming a dataset from long to wide (image source)

In some instances, we may wish to convert a dataset to a wider format. For this, we can use the pivot_wider() function.

A typical use case is when we want to transform the results of an analysis into a format which is more digestible for the reader. Typically, we are transforming a dataset in which the observations are spread over multiple rows to one in which each observation occupies a single row.

This introduces the useful topic of “tidy data”, in which each variable has it’s own column, each observation has it’s own row, and each value has it’s own cell. More about this topic can be found here https://r4ds.had.co.nz/tidy-data.html.

Data

Let us use the linelist dataset. Suppose that we want to know the counts of individuals in the different age groups, by sex:

linelist <- 
  linelist %>% 
  as_tibble()
  
df_wide <- 
  linelist %>% 
  count(age_cat, gender)

This gives us a long dataset that is great for visualisation, but not ideal for presentation in a table:

ggplot(df_wide) +
  geom_col(aes(x = age_cat, y = n, fill = gender))

Pivot wider

Therefore, we can use pivot_wider() to put this into a better format for inclusion as tables in our reports. The argument names_from specifies the column from which to generate the new column names, while the argument values_from specifies the column from which to take the values to populate the cells:

table_wide <- 
  df_wide %>% 
  pivot_wider(
    names_from = gender,
    values_from = n
  )

table_wide
## # A tibble: 9 x 4
##   age_cat     f     m  `NA`
##   <fct>   <int> <int> <int>
## 1 0-4       624   404    38
## 2 5-9       651   414    38
## 3 10-14     555   334    29
## 4 15-19     381   367    25
## 5 20-29     440   626    36
## 6 30-49     161   539    24
## 7 50-69       3    93     6
## 8 70+        NA    12     1
## 9 <NA>       NA    NA    87

This table is much nicer for inclusion in our reports:

table_wide %>% 
  janitor::adorn_totals(c("row", "col")) %>% # adds a total row and column
  knitr::kable() %>% 
  kableExtra::row_spec(row = 9, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
age_cat f m NA Total
0-4 624 404 38 1066
5-9 651 414 38 1103
10-14 555 334 29 918
15-19 381 367 25 773
20-29 440 626 36 1102
30-49 161 539 24 724
50-69 3 93 6 102
70+ NA 12 1 13
NA NA NA 87 87
Total 2815 2789 284 5888

Fill

Filling in missing data

Data

In some situations after a pivot, and more commonly after a bind, we are left with gaps in some cells that we would like to fill. For example, take two datasets, each with observations for the measurement number, the name of the facility, and the case count at that time. However, the second dataset also has a variable Year. When we perform a bind_rows() to join the two datasets together, the Year variable is filled with NA for those rows where there was no prior information (i.e. the first dataset):

df1 <- 
  tibble::tribble(
       ~Measurement, ~Facility, ~Cases,
                  1,  "Hosp 1",     66,
                  2,  "Hosp 1",     26,
                  3,  "Hosp 1",      8,
                  1,  "Hosp 2",     71,
                  2,  "Hosp 2",     62,
                  3,  "Hosp 2",     70,
                  1,  "Hosp 3",     47,
                  2,  "Hosp 3",     70,
                  3,  "Hosp 3",     38,
       )

df1 
## # A tibble: 9 x 3
##   Measurement Facility Cases
##         <dbl> <chr>    <dbl>
## 1           1 Hosp 1      66
## 2           2 Hosp 1      26
## 3           3 Hosp 1       8
## 4           1 Hosp 2      71
## 5           2 Hosp 2      62
## 6           3 Hosp 2      70
## 7           1 Hosp 3      47
## 8           2 Hosp 3      70
## 9           3 Hosp 3      38
df2 <- 
  tibble::tribble(
    ~Year, ~Measurement, ~Facility, ~Cases,
     2000,            1,  "Hosp 4",     82,
     2001,            2,  "Hosp 4",     87,
     2002,            3,  "Hosp 4",     46
  )

df2
## # A tibble: 3 x 4
##    Year Measurement Facility Cases
##   <dbl>       <dbl> <chr>    <dbl>
## 1  2000           1 Hosp 4      82
## 2  2001           2 Hosp 4      87
## 3  2002           3 Hosp 4      46
df_combined <- 
  bind_rows(df1, df2) %>% 
  arrange(Measurement, Facility)

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66    NA
##  2           1 Hosp 2      71    NA
##  3           1 Hosp 3      47    NA
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26    NA
##  6           2 Hosp 2      62    NA
##  7           2 Hosp 3      70    NA
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8    NA
## 10           3 Hosp 2      70    NA
## 11           3 Hosp 3      38    NA
## 12           3 Hosp 4      46  2002

fill()

In this case, Year is a useful variable to include, particularly if we want to explore trends over time. Therefore, we use fill() to fill in those empty cells, by specifying the column to fill and the direction (in this case up):

df_combined %>% 
  fill(Year, .direction = "up")
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66  2000
##  2           1 Hosp 2      71  2000
##  3           1 Hosp 3      47  2000
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26  2001
##  6           2 Hosp 2      62  2001
##  7           2 Hosp 3      70  2001
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8  2002
## 10           3 Hosp 2      70  2002
## 11           3 Hosp 3      38  2002
## 12           3 Hosp 4      46  2002

We can rearrange the data so that we would need to fill in a downward direction:

df_combined <- 
  df_combined %>% 
  arrange(Measurement, desc(Facility))

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47    NA
##  3           1 Hosp 2      71    NA
##  4           1 Hosp 1      66    NA
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70    NA
##  7           2 Hosp 2      62    NA
##  8           2 Hosp 1      26    NA
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38    NA
## 11           3 Hosp 2      70    NA
## 12           3 Hosp 1       8    NA
df_combined <- 
  df_combined %>% 
  fill(Year, .direction = "down")

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47  2000
##  3           1 Hosp 2      71  2000
##  4           1 Hosp 1      66  2000
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70  2001
##  7           2 Hosp 2      62  2001
##  8           2 Hosp 1      26  2001
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38  2002
## 11           3 Hosp 2      70  2002
## 12           3 Hosp 1       8  2002

This dataset is now useful for plotting:

ggplot(df_combined) +
  aes(Year, Cases, fill = Facility) +
  geom_col()

But less useful for presenting in a table, so let’s practice converting this long, untidy dataframe into a wider, tidy dataframe:

df_combined %>% 
  pivot_wider(
    id_cols = c(Facility, Year, Cases),
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  arrange(Facility) %>% 
  janitor::adorn_totals(c("row", "col")) %>% 
  knitr::kable() %>% 
  kableExtra::row_spec(row = 5, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
Facility 2000 2001 2002 Total
Hosp 1 66 26 8 100
Hosp 2 71 62 70 203
Hosp 3 47 70 38 155
Hosp 4 82 87 46 215
Total 266 245 162 673

N.B. In this case, we had to specify to only include the three variables Facility, Year, and Cases as the additional variable Measurement would interfere with the creation of the table:

df_combined %>% 
  pivot_wider(
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  knitr::kable()
Measurement Facility 2000 2001 2002
1 Hosp 4 82 NA NA
1 Hosp 3 47 NA NA
1 Hosp 2 71 NA NA
1 Hosp 1 66 NA NA
2 Hosp 4 NA 87 NA
2 Hosp 3 NA 70 NA
2 Hosp 2 NA 62 NA
2 Hosp 1 NA 26 NA
3 Hosp 4 NA NA 46
3 Hosp 3 NA NA 38
3 Hosp 2 NA NA 70
3 Hosp 1 NA NA 8

Resources

Here is a helpful tutorial

Grouping data

This page reviews how to group and aggregate data for descriptive analysis. It makes use of tidyverse packages for common and easy-to-use functions.

Overview

Grouping data is a core component of data management and analysis. Grouped data can be plotted, or statistically summarised by group. Functions from the dplyr package (part of the tidyverse) make grouping and subsequent operations quite easy.

This page will address the following topics:

  • Grouping data with the group_by() function
  • Un-group data
  • summarise() grouped data with statistics
  • The difference between count() and tally()
  • arrange() applied to grouped data
  • filter() applied to grouped data
  • mutate() applied to grouped data
  • select() applied to grouped data
  • The base R aggregate() command as an alternative

Preparation

Load packages

Ensure tidyverse package is installed and loaded (includes dplyr).

pacman::p_load(
  rio,       # to import data
  here,      # to locate files
  tidyverse, # to clean, handle, and plot the data (includes dplyr)
  janitor)   # adding total rows and columns

Load data

For this page we use the cleaned linelist dataset

linelist <- rio::import(here("data", "linelist_cleaned.xlsx"))

The first 50 rows of linelist:

Grouping

The function group_by() from dplyr groups the rows by the unique values in the specified columns. Each unique value contitutes a group (or unique combination of values, if multiple grouping columns are specified). Subsequent changes to the dataset or calculations can then be performed within the context of each unique group.

For example, the command below takes the linelist and groups the rows by unique values in column outcome, saving the output as a new dataframe ll_by_outcome. The grouping column name is placed inside the parentheses of the function group_by().

ll_by_outcome <- linelist %>% 
  group_by(outcome)

Note that there is no perceptible change to the dataset after group_by(), until another dplyr verb such as mutate() or summarise() is applied on the “grouped” dataframe.

You can however “see” the groupings by printing the dataframe. When you print a grouped dataframe, you will see it has been transformed into a tibble class object (LINK) which, when printed, displays which grouping columns have been applied and how many groups there are - written just above the header row.

# print to see which groups are active
ll_by_outcome
## # A tibble: 5,888 x 30
## # Groups:   outcome [3]
##    case_id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5
##    <chr>        <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>   
##  1 a3c8b8           4 2014-05-07     2014-05-08 2014-05-10       2014-05-14   Recover m          1 years            1 0-4     0-4     
##  2 d8a13d           4 2014-05-06     2014-05-08 2014-05-10       NA           <NA>    f          4 years            4 0-4     0-4     
##  3 5fe599           4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m         21 years           21 20-29   20-24   
##  4 8689b7           4 NA             2014-05-13 2014-05-14       2014-05-18   Recover f          2 years            2 0-4     0-4     
##  5 11f8ea           2 NA             2014-05-16 2014-05-18       2014-05-30   Recover m         27 years           27 20-29   25-29   
##  6 893f25           3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m         25 years           25 20-29   25-29   
##  7 be99c8           3 2014-05-03     2014-05-22 2014-05-23       2014-05-24   Recover f         18 years           18 15-19   15-19   
##  8 d0523a           7 2014-05-20     2014-05-24 2014-05-26       2014-06-05   <NA>    f          2 years            2 0-4     0-4     
##  9 ce9c02           5 2014-05-27     2014-05-27 2014-05-29       2014-06-17   Death   m         20 years           20 20-29   20-24   
## 10 275cc7           5 2014-05-24     2014-05-27 2014-05-28       2014-06-07   Death   f          4 years            4 0-4     0-4     
## # ... with 5,878 more rows, and 17 more variables: hospital <chr>, lon <dbl>, lat <dbl>, infector <chr>, source <chr>, wt_kg <dbl>,
## #   ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>,
## #   bmi <dbl>, days_onset_hosp <dbl>

Unique groups

The groups created reflect each unique combination of values in the grouping columns. To see the groups and the number of rows in each group, pass the grouped data to tally(). To see just the unique groups without counts you can pass to group_keys().

See below that there are three unique values in the grouping column outcome: “Death”, “Recover”, and NA. See that there were 2582 deaths, 1983 recoveries, and 1323 with no outcome recorded.

linelist %>% 
  group_by(outcome) %>% 
  tally()
## # A tibble: 3 x 2
##   outcome     n
## * <chr>   <int>
## 1 Death    2582
## 2 Recover  1983
## 3 <NA>     1323

You can group by more than one column. Below, the dataframe is grouped by outcome and gender, and then tallied. Note how each unique combination of outcome and gender is registered as its own group - including missing values for either column.

linelist %>% 
  group_by(outcome, gender) %>% 
  tally()
## # A tibble: 9 x 3
## # Groups:   outcome [3]
##   outcome gender     n
##   <chr>   <chr>  <int>
## 1 Death   f       1246
## 2 Death   m       1231
## 3 Death   <NA>     105
## 4 Recover f        946
## 5 Recover m        933
## 6 Recover <NA>     104
## 7 <NA>    f        623
## 8 <NA>    m        625
## 9 <NA>    <NA>      75

New columns

You can also create a new grouping column within the group_by() statement. This is equivalent to calling mutate() before the group_by(). For a quick tabulation this style can be handy, but for more clarity in your code consider creating this column in it’s own mutate() step and then piping to group_by().

# group dat based on a binary column created *within* the group_by() command
linelist %>% 
  group_by(
    age_class = ifelse(age >= 18, "adult", "child")) %>% 
  tally(sort = T)
## # A tibble: 3 x 2
##   age_class     n
##   <chr>     <int>
## 1 child      3546
## 2 adult      2255
## 3 <NA>         87

Replace/add grouping columns

By default if you run group_by() on data that are already grouped, the old groups will be removed and the new one(s) will apply. If you want to add new groups to the existing ones, include the argument .add=TRUE.

# Grouped by outcome
by_outcome <- linelist %>% 
  group_by(outcome)

# Add grouping by gender in addition
by_outcome_gender <- by_outcome %>% 
  group_by(gender, .add = TRUE)

Un-group

Data that have been grouped will remain grouped until specifically ungrouped via ungroup(). If you forget to ungroup, it can lead to incorrect calculations! Below is an example of removing all grouping columns:

linelist %>% 
  group_by(outcome, gender) %>% 
  tally() %>% 
  ungroup()

You can also remove grouping by only specific columns, by placing the column name inside.

linelist %>% 
  group_by(outcome, gender) %>% 
  tally() %>% 
  ungroup(gender) # remove the grouping by gender, leave grouping by outcome

NOTE: The verb count() automatically ungroups the data after counting.

Summarise

By applying the dplyr verb summarise() to grouped data, you can produce summary tables containing descriptive statistics for each group.

Within the summarise statement, provide the name(s) of the new summary column(s), an equals sign, and then a statistical function to apply to the data, as shown below. Within a statistical function, list the column to be operated on and any relevant arguements. For example, do not forget the argument na.rm=TRUE to remove missing values from calculations!

Below is an example of summarise() applied without grouped data. The statistics returned are produced from the entire dataset.

# summary statistics on ungrouped linelist
linelist %>% 
  summarise(
    mean_age = mean(age_years, na.rm=T),
    max_age  = max(age_years, na.rm=T),
    min_age  = min(age_years, na.rm=T))
##   mean_age max_age min_age
## 1 16.14426      90       0

In contrast, below is the same summarise() statement applied to grouped data. The statistics are calculated for each outcome group.

# summary statistics on grouped linelist
linelist %>% 
  group_by(outcome) %>% 
  summarise(
    mean_age = mean(age_years, na.rm=T),
    max_age  = max(age_years, na.rm=T),
    min_age  = min(age_years, na.rm=T))
## # A tibble: 3 x 4
##   outcome mean_age max_age min_age
## * <chr>      <dbl>   <dbl>   <dbl>
## 1 Death       16.0      90       0
## 2 Recover     16.4      75       0
## 3 <NA>        16.0      87       0

TIP: The summarise function works with both UK and US spelling - summarise() and summarize() call the same function.

Summarise across() multiple columns

You can use summarise across multiple columns using across(). Specify which column to operate across by either:

  • Provide a vector of column names, or
  • Use the select()semantic helper functions (explained below) to select columns by criteria

Below, mean() is applied to ungrouped data (global calculation). The columns are specified, a function is specified (no parentheses), and finally, any additional arguments for the function (e.g. na.rm=TRUE).

linelist %>% 
  summarise(across(.cols = c(age_years, temp),
                   .fns = mean,
                   na.rm=T))
##   age_years     temp
## 1  16.14426 38.54144

Below, the same summarise across() call is applied on grouped data:

linelist %>% 
  group_by(outcome) %>% 
  summarise(across(.cols = c(age_years, temp),
                   .fns = mean,
                   na.rm=T))
## # A tibble: 3 x 3
##   outcome age_years  temp
## * <chr>       <dbl> <dbl>
## 1 Death        16.0  38.5
## 2 Recover      16.4  38.6
## 3 <NA>         16.0  38.5

Here are those select() helper functions that you can place within across():

These are helper functions available to assist you in specifying columns:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: starts_with("date")
  • ends_with() - matches to a specified suffix. Example: ends_with("_end")
  • contains() - columns containing a character string. Example: contains("time")
  • matches() - to apply a regular expression (regex). Example: contains("[pt]al")
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: any_of(date_onset, date_death, cardiac_arrest)

For example, to return the mean of every numeric column:

linelist %>% 
  group_by(outcome) %>% 
  summarise(across(where(is.numeric), .fns = mean, na.rm=T))
## # A tibble: 3 x 12
##   outcome generation   age age_years   lon   lat wt_kg ht_cm ct_blood  temp   bmi days_onset_hosp
## * <chr>        <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>           <dbl>
## 1 Death         16.7  16.1      16.0 -13.2  8.47  52.8  124.     21.3  38.5  48.8            1.79
## 2 Recover       16.4  16.5      16.4 -13.2  8.47  53.5  126.     21.1  38.6  47.7            2.28
## 3 <NA>          16.5  16.0      16.0 -13.2  8.47  53.3  125.     21.2  38.5  47.4            2.04

If you want summary statistics of multiple columns, in an easy-to-read format, consider a two-way table with the gtsummary package This package is demonstrated more extensively in the the [Descriptive statistics] page.

# load package
pacman::p_load(gtsummary) 

linelist %>% 
  select(outcome, age_years, temp, ht_cm) %>%       # select columns (optional)
  gtsummary::tbl_summary( 
    by = outcome,                                   # indicate grouping column (optional)
    statistic = all_continuous() ~ "{mean} ({sd})") # return mean and std deviation for each group
Characteristic Death, N = 2,5821 Recover, N = 1,9831
age_years 16 (13) 16 (13)
Unknown 24 38
temp 38.52 (1.00) 38.57 (0.99)
Unknown 55 52
ht_cm 124 (50) 126 (50)

1 Mean (SD)

Count and tally

count() and tally() provide similar functionality but are different.

tally() is shorthand for summarise(), and does not automatically group data. Thus, to achieve grouped tallys it must follow a group_by() command. You can add sort = TRUE to see the largest groups first.

linelist %>% 
  tally
##      n
## 1 5888
linelist %>% 
  group_by(outcome) %>% 
  tally(sort = TRUE)
## # A tibble: 3 x 2
##   outcome     n
##   <chr>   <int>
## 1 Death    2582
## 2 Recover  1983
## 3 <NA>     1323

In contrast, count() does the following:

  • applies group_by() on the specified column(s)
  • applies summarise() and returns column n with the number of observations per group
  • applies ungroup()
linelist %>% 
  count(outcome)
##   outcome    n
## 1   Death 2582
## 2 Recover 1983
## 3    <NA> 1323

Just like with group_by() you can create a new column within the count() command:

linelist %>% 
  count(age_class = ifelse(age >= 18, "adult", "child"), sort = T)
##   age_class    n
## 1     child 3546
## 2     adult 2255
## 3      <NA>   87

Read more about the distinction between tally() and count() here

Both of these verbs can be called multiple times, with the functionality “rolling up”. For example, to summarise the number of genders present for each outcome, run the following. Note, the name of the final column is changed from default “n” for clarity.

linelist %>% 
  # produce counts by outcome-gender groups
  count(outcome, gender) %>% 
  # produce counts of gender within each outcome group
  count(outcome, name = "number of genders per outcome" ) 
##   outcome number of genders per outcome
## 1   Death                             3
## 2 Recover                             3
## 3    <NA>                             3

Add totals

If you want to add total rows or column after using tally() or count(), consider using the janitor package, which offers functions like adorn_totals() and adorn_percentages(). There are many useful functions (search Help for details), here are a few of them, with examples below:

  • Use adorn_totals() to get totals - specify the argument where = either “row” or “col” or c("row", "col").
  • Use adorn_percentages() to convert counts to proportions - specify the argument denominator = either “row”, “col”, or “all”.
  • Use adorn_pct_formatting() to convert proportions to percentages (can specify number of digits =, whether to add “%” with affix_sign =, and specify specific column names to operate on)
  • Use adorn_ns() to add back the underlying counts (“N”s) to a table whose proportions were calculated by adorn_percentages() - to display them together. Indicate position = of the Ns as either “rear” or “front” of the proportions.

To add totals:

linelist %>% 
  count(outcome) %>% 
  adorn_totals(where = "col")
##  outcome    n Total
##    Death 2582  2582
##  Recover 1983  1983
##     <NA> 1323  1323

To convert the numbers to proportions:

linelist %>% 
  count(outcome) %>% 
  adorn_totals(where = "row") %>%              # add total row
  adorn_percentages(denominator = "col") %>%   # convert to column proportions
  adorn_rounding(digits = 2)                   # round the proportions
##  outcome    n
##    Death 0.44
##  Recover 0.34
##     <NA> 0.22
##    Total 1.00

To display both counts and percents:

linelist %>% 
  count(outcome) %>%              # produce the counts by unique outcome
  adorn_totals(where = "row") %>% # add total row
  adorn_percentages("col") %>%    # add proportion by column
  adorn_pct_formatting() %>%      # proportion converted to percent
  adorn_ns(position = "front")    # Add the underlying N, in front of the percentage
##  outcome             n
##    Death 2582  (43.9%)
##  Recover 1983  (33.7%)
##     <NA> 1323  (22.5%)
##    Total 5888 (100.0%)

Arranging grouped data

Using the dplyr verb arrange() to order the rows in a dataframe behaves the same when the data are grouped, unless you set the argument .by_group =TRUE. In this case the rows are ordered first by the grouping columns and then by any other columns you specify to arrange().

Filter on grouped data

filter()

When applied in conjunction with functions that evaluate the dataframe (like max(), min(), mean()), these functions will now be applied to the groups. For example, if you want to filter and keep rows where patients are above the median age, this will now apply per group - filtering to keep rows above the group’s median age.

Slice rows per group

The dplyr function slice(), which subsets rows based on their position in the data, can also be applied per group. Remember to account for sorting the data within each group to get the desired “slice”.

For example, to retrieve only the latest 5 admissions from each hospital:

  1. Group the linelist by column hospital
  2. Arrange the records from latest to earliest date_hospitalisation within each hospital group
  3. Slice to retrieve the first 5 rows from each hospital
linelist %>%
  group_by(hospital) %>%
  arrange(hospital, date_hospitalisation) %>%
  slice_head(n = 5) %>% 
  arrange(hospital) %>% 
  select(case_id, hospital, date_hospitalisation)
## # A tibble: 30 x 3
## # Groups:   hospital [6]
##    case_id hospital          date_hospitalisation
##    <chr>   <chr>             <date>              
##  1 20b688  Central Hospital  2014-05-06          
##  2 d58402  Central Hospital  2014-05-10          
##  3 b8f2fd  Central Hospital  2014-05-13          
##  4 275cc7  Central Hospital  2014-05-28          
##  5 acf422  Central Hospital  2014-05-28          
##  6 d1fafd  Military Hospital 2014-04-17          
##  7 6a9004  Military Hospital 2014-05-13          
##  8 974bc1  Military Hospital 2014-05-13          
##  9 09e386  Military Hospital 2014-05-14          
## 10 865581  Military Hospital 2014-05-15          
## # ... with 20 more rows

slice_head() - selects n rows from the top
slice_tail() - selects n rows from the end
slice_sample() - randomly selects n rows
slice_min() - selects n rows with highest values in order_by = column, use with_ties = TRUE to keep ties
slice_max() - selects n rows with lowest values in order_by = column, use with_ties = TRUE to keep ties

Filter on group size

The function add_count() adds a column n to the original data giving the number of rows in that row’s group.

Shown below for simplicity is a selection of two columns from the linelist data. add_count() is applied to hospital, so the values in column n reflect the number of rows in that row’s hospital group. Note how values in column n are repeated. In the example below, the column name n could be changed using name = within add_count().

linelist %>% 
  select(case_id, hospital) %>% 
  add_count(hospital) %>%          # add "number of rows admitted to same hospital as this row" 
  head(10)                         # show just the first 10 rows, for demo purposes
##    case_id                             hospital    n
## 1   a3c8b8                        Port Hospital 1762
## 2   d8a13d St. Mark's Maternity Hospital (SMMH)  422
## 3   5fe599                                Other  885
## 4   8689b7                              Missing 1469
## 5   11f8ea St. Mark's Maternity Hospital (SMMH)  422
## 6   893f25                    Military Hospital  896
## 7   be99c8                        Port Hospital 1762
## 8   d0523a                              Missing 1469
## 9   ce9c02                        Port Hospital 1762
## 10  275cc7                     Central Hospital  454

It then becomes easy to filter for case rows who were hospitalized at a “small” hospital, say, a hospital that admitted fewer than 500 patients:

linelist %>% 
  select(case_id, hospital) %>% 
  add_count(hospital) %>% 
  filter(n < 500)

Mutate on grouped data

To retain all columns and rows (not summarize) and add a new variable containing group statistics, use mutate() instead of summarise().

This is useful if you want group statistics in the original dataset with all other column present - e.g. for calculations comparing one row to the group.

For example, this code below calculates the difference between a row’s delay-to-admission and the median delay for their hospital. The steps are:

  1. Group the data by hospital
  2. Use the column days_onset_hosp (delay to hospitalisation) to create a new column containing the mean delay at the hospital of that row
  3. Calculate the difference between the two columns
linelist %>% 
  # group data by hospital (no change to linelist yet)
  group_by(hospital) %>% 
  
  # new columns
  mutate(
    # mean days to admission per hospital (rounded to 1 decimal)
    group_delay_admit = round(mean(days_onset_hosp, na.rm=T), 1),
    
    # difference between row's delay and mean delay at their hospital (rounded to 1 decimal)
    diff_to_group     = round(days_onset_hosp - group_delay_admit, 1)) %>%
  
  # select certain rows only - for demonstration/viewing purposes
  select(case_id, hospital, days_onset_hosp, group_delay_admit, diff_to_group)
## # A tibble: 5,888 x 5
## # Groups:   hospital [6]
##    case_id hospital                             days_onset_hosp group_delay_admit diff_to_group
##    <chr>   <chr>                                          <dbl>             <dbl>         <dbl>
##  1 a3c8b8  Port Hospital                                      2               2             0  
##  2 d8a13d  St. Mark's Maternity Hospital (SMMH)               2               2             0  
##  3 5fe599  Other                                              2               2             0  
##  4 8689b7  Missing                                            1               2            -1  
##  5 11f8ea  St. Mark's Maternity Hospital (SMMH)               2               2             0  
##  6 893f25  Military Hospital                                  1               2.1          -1.1
##  7 be99c8  Port Hospital                                      1               2            -1  
##  8 d0523a  Missing                                            2               2             0  
##  9 ce9c02  Port Hospital                                      2               2             0  
## 10 275cc7  Central Hospital                                   1               1.9          -0.9
## # ... with 5,878 more rows

Select on grouped data

The verb select() works on grouped data, but the grouping columns are always included (even if not mentioned in select()). If you do not want these grouping columns, use ungroup() first.

Joining data

This page describes common “joins” and also probabilistic matching between dataframes.

Preparation

Load packages

pacman::p_load(
  rio,            # import/export
  here,           # relative filepaths
  tidyverse,      # data management/viz
  RecordLinkage,  # probabilistic matches
  fastLink        # probabilistic matches
)

Because traditional joins (non-probabilistic) can be very specific, requiring exact string matches, you may need to do cleaning on the datasets prior to the join (e.g. change spellings, change case to all lower or upper).

Load data

This page uses the linelist dataframe and small other example datasets as needed.

Datasets

In the joining examples, we’ll use the following datasets:

  1. A “miniature” version of the linelist, containing only the columns case_id, date_onset, and hospital, and only the first 10 rows
  2. A separate dataframe named hosp_info, which contains more details about each hospital
  3. Two separate small dataframes for the probabilistic matching section

“Miniature” linelist

Below is the the miniature linelist, which contains only 10 rows and only columns case_id, date_onset, and hospital.

linelist_mini <- linelist %>%                 # start with original linelist
  select(case_id, date_onset, hospital) %>%   # select columns
  head(10)                                    # only take the first 10 rows

Hospital information dataframe

Below is the separate dataframe with additional information about each hospital.

# Make the hospital information dataframe
hosp_info = data.frame(
  hosp_name     = c("central hospital", "military", "military", "port", "St. Mark's", "ignace", "sisters"),
  catchment_pop = c(1950280, 40500, 10000, 50280, 12000, 5000, 4200),
  level         = c("Tertiary", "Secondary", "Primary", "Secondary", "Secondary", "Primary", "Primary")
)

Pre-cleaning

Because traditional (non-probabilistic) joins are case-sensitive and require exact string matches, we will clean-up the hosp_info dataset prior to the joins.

Identify differences

We need the values of hosp_name column in hosp_info dataframe to match the values of hospital column in the linelist dataframe.

Here are the values in linelist_mini:

unique(linelist_mini$hospital)
## [1] "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)" "Other"                               
## [4] "Missing"                              "Military Hospital"                    "Central Hospital"

and here are the values in hosp_info:

unique(hosp_info$hosp_name)
## [1] "central hospital" "military"         "port"             "St. Mark's"       "ignace"           "sisters"

Align matching values

We begin by cleaning the values in hosp_name. We use logic to code the values in the new column using case_when() (read more about case_when() in the Cleaning data and core functions page). We correct the hospital names that exist in both dataframes, and leave the others as they are (TRUE ~ hosp_name).

CAUTION: Typically, one should create a new column (e.g. hosp_name_clean), but for ease of demonstration we show modification of the old column

hosp_info <- hosp_info %>% 
  mutate(
    hosp_name = case_when(
      hosp_name == "military"          ~ "Military Hospital",
      hosp_name == "port"              ~ "Port Hospital",
      hosp_name == "St. Mark's"        ~ "St. Mark's Maternity Hospital (SMMH)",
      hosp_name == "central hospital"  ~ "Central Hospital",
      TRUE                             ~ hosp_name
      )
    )

We now see that the hospital names that appear in both dataframe are aligned. There are some hospitals in hosp_info that are not present in linelist - we will deal with these later, in the join.

unique(hosp_info$hosp_name)
## [1] "Central Hospital"                     "Military Hospital"                    "Port Hospital"                       
## [4] "St. Mark's Maternity Hospital (SMMH)" "ignace"                               "sisters"

If you need to convert all values to UPPER or lower case, use these functions from stringr, as shown in the page on Characters and strings.

str_to_upper()
str_to_upper()
str_to_title()

dplyr joins

The dplyr package offers several different joins. dplyr is included in the tidyverse package. These join functions are described below, with simple use cases. Many thanks to https://github.com/gadenbuie for the moving images!

General syntax

General function structure

Any of these join commands can be run independently, like below.

An object is being created, or re-defined: dataframe 2 (df2) is being joined to dataframe 1 (df1), on the basis of matches between the column “ID” in df1 and the column “identifier” in df2. Because this example uses left_join(), any rows in df2 that do not match to a row in df1 will be dropped.

object <- left_join(df1, df2, by = c("ID" = "identifier"))

The join commands can also be run within a pipe chain. The first dataframe df1 is the dataframe that is being passed through the pipes. df2 is joined to it with the left_join() command. An example is shown below.

object <- df1 %>%
  left_join(df2, by = c("ID" = "identifier"))  # join df2 to df1

Join columns (by =)

You must specify the columns in each dataset in which the values must match, using the arguemnt by =. You have a few options:

  • Specify only one column name (by = "ID") - this only works if this exact column name is present in both dataframes!
  • Specify the different names (by = c("ID" = "Identifier") - use this if the column names are different in the 2 dataframes
  • Specify multiple columns to match on (by = c("ID" = "Identifier", "date_onset" = "Date_of_Onset")) - this will require exact matches on multiple columns for rows to join.

CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.

Add columns: left & right joins

A left or right join is commonly used to add information to a dataframe - new information is added only to rows that already exist in the baseline (starting) dataframe.

These are common joins in epidemiological work - they are used to add information from one dataset into another.

The order of the dataframes is important.

  • In a left join, the first (left) dataframe listed is the baseline
  • In a right join, the second (right) dataframe listed is the baseline

All rows of the baseline dataframe are kept. Information in the secondary dataframe is joined to the baseline dataframe only if there is a match via the identifier column(s). In addition:

  • Rows in the secondary dataframe that do not match are dropped.
  • If there are many baseline rows that match to one row in the secondary dataframe (many-to-one), the baseline information is added to each matching baseline row.
  • If a baseline row matches to multiple rows in the secondary dataframe (one-to-many), all combinations are given, meaning new rows may be added to your returned dataframe!

Example

Below is the output of a left_join() of hosp_info (secondary dataframe) into linelist_mini (baseline dataframe). Note the following:

  • Two new columns, catchment_pop and level have been added on the left
  • All original rows of the baseline dataframe linelist_mini are kept
  • Any original rows of linelist_mini for “Military Hospital” are duplicated because it matched to two rows in the secondary dataframe, so both combinations are returned
  • The join identifier column of the secondary dataset (hosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (hospital)
  • When a baseline row did not match to any secondary row (e.g. when hospital is “Other” or “Missing”), NA fills in the columns from the secondary dataframe
  • Rows in the secondary dataframe with no match to the baseline dataframe (“sisters” and “ignace”) were dropped
linelist_mini %>% 
  left_join(hosp_info, by = c("hospital" = "hosp_name"))

“Should I use a right join, or a left join?”
Most important is to ask “which dataframe should retain all of its rows?” - use this one as the baseline.

The two commands below achieve the same output - 10 rows of hosp_info joined into a linelist_mini baseline. However, the column order will differ based on whether hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift consequently.

Also consider whether your use-case is within a pipe chain (%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.

# The two commands below achieve the same data, but with differently ordered rows and columns
left_join(linelist_mini, hosp_info, by = c("hospital" = "hosp_name"))
right_join(hosp_info, linelist_mini, by = c("hosp_name" = "hospital"))

Full join

A full join is the most inclusive of the joins - it returns all rows from both dataframes.

If there are any rows present in one and not the other (where no match was found), the dataframe will become wider as NA values are added to fill-in. Watch the number of columns and rows carefully and troubleshoot case-sensitivity and exact string matches.

Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.

Example

Below is the output of a full_join() of hosp_info into linelist_mini. Note the following:

  • All baseline rows (linelist_mini) are kept
  • Any baseline rows for “Military Hospital” are duplicated because they match to two secondary rows and both combinations are returned
  • Only the identifier column from the baseline is kept (hospital)
  • NA fills in where baseline rows did not match to secondary rows (hospital was “Other” or “Missing”), or the opposite (where hosp_name was “ignace” or “sisters”)
linelist_mini %>% 
  full_join(hosp_info, by = c("hospital" = "hosp_name"))

Inner join

An inner join is the most restrictive of the joins - it returns only rows with matches across both dataframes.
This means that your original dataset may reduce in number of rows. Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.

Example

Below is the output of an inner_join() of linelist_mini (baseline) with hosp_info (secondary). Note the following:

  • Not all baseline rows are kept (rows where hospital is “Missing” or “Other” are removed because had no match in the secondary dataframe
  • Likewise, secondary rows where hosp_name is “sisters” or “ignace” are removed as they have no match in the baseline dataframe
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))
hosp_info %>% 
  inner_join(linelist_mini, by = c("hosp_name" = "hospital"))

Semi join

A semi join is a “filtering join” which uses another dataset not to add rows or columns, but to perform filtering.
A semi-join keeps all observations in dataframe 1 that have a match in dataframe 2 (but does not add new columns or duplicate any rows with multiple matches). Read more about filtering joins here.

The below code would return 0 rows, because the two dataframes are completely different - there are no rows that are in both.

hosp_info %>% 
  semi_join(linelist_mini, by = c("hosp_name" = "hospital"))

Anti join

The anti join is a “filtering join” that returns rows in dataframe 1 that do not have a match in dataframe 2.

Read more about filtering joins here.

Common scenarios for an anti-join include identifying records not present in another dataframe, troubleshooting spelling in a join (catching records that should have matched), and examining records that were excluded after another join.

As with right_join() and left_join(), the baseline dataframe (listed first) is important. The returned rows are from it only. Notice in the gif below that row in the non-baseline dataframe (purple 4) is not returned even though it does not match.

Simple anti_join() example

For an example, let’s find the hosp_info hospitals that do not have any cases present in linelist_mini. We list hosp_info first, as the baseline dataframe. The two hospitals which are not present in linelist_mini are returned.

hosp_info %>% 
  anti_join(linelist_mini, by = c("hosp_name" = "hospital"))

anti_join() example 2

For another example, let us say we ran an inner_join() between linelist_mini and hosp_info. This returns only 8 of the original 11 linelist_mini records.

linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))

To review the 3 linelist_mini records that were excluded in the inner join, we can run an anti-join with linelist_mini as the baseline dataframe.

linelist_mini %>% 
  anti_join(hosp_info, by = c("hospital" = "hosp_name"))

To see the hosp_info records that were excluded in the inner join, we could also run an anti-join with hosp_info as the baseline dataframe.

Probabalistic matching

If you do not have a unique identifier common across datasets to join on, consider using a probabilistic matching algorithm. This would find matches between records based on similarity (e.g. Jaro–Winkler string distance, or numeric distance). Below is a simple example using the package fastLink .

Load packages

pacman::p_load(
  tidyverse,      # data manipulation and visualization
  fastLink        # record matching
  )

Here are two small example datasets that we will use to demonstrate the probabilistic matching (cases and test_results):

Here is the code used to make the datasets:

# make datasets

cases <- tribble(
  ~gender, ~first,      ~middle,     ~last,        ~yr,   ~mon, ~day, ~district,
  "M",     "Amir",      NA,          "Khan",       1989,  11,   22,   "River",
  "M",     "Anthony",   "B.",        "Smith",      1970, 09, 19,      "River", 
  "F",     "Marialisa", "Contreras", "Rodrigues",  1972, 04, 15,      "River",
  "F",     "Elizabeth", "Casteel",   "Chase",      1954, 03, 03,      "City",
  "M",     "Jose",      "Sanchez",   "Lopez",      1996, 01, 06,      "City",
  "F",     "Cassidy",   "Jones",      "Davis",     1980, 07, 20,      "City",
  "M",     "Michael",   "Murphy",     "O'Calaghan",1969, 04, 12,      "Rural", 
  "M",     "Oliver",    "Laurent",    "De Bordow" , 1971, 02, 04,     "River",
  "F",      "Blessing",  NA,          "Adebayo",   1955,  02, 14,     "Rural"
)

results <- tribble(
  ~gender,  ~first,     ~middle,     ~last,          ~yr, ~mon, ~day, ~district, ~result,
  "M",      "Amir",     NA,          "Khan",         1989, 11,   22,  "River", "positive",
  "M",      "Tony",   "B",         "Smith",          1970, 09,   19,  "River", "positive",
  "F",      "Maria",    "Contreras", "Rodriguez",    1972, 04,   15,  "Cty",   "negative",
  "F",      "Betty",    "Castel",   "Chase",        1954,  03,   30,  "City",  "positive",
  "F",      "Andrea",   NA,          "Kumaraswamy",  2001, 01,   05,  "Rural", "positive",      
  "F",      "Caroline", NA,          "Wang",         1988, 12,   11,  "Rural", "negative",
  "F",      "Trang",    NA,          "Nguyen",       1981, 06,   10,  "Rural", "positive",
  "M",      "Olivier" , "Laurent",   "De Bordeaux",  NA,   NA,   NA,  "River", "positive",
  "M",      "Mike",     "Murphy",    "O'Callaghan",  1969, 04,   12,  "Rural", "negative",
  "F",      "Cassidy",  "Jones",     "Davis",        1980, 07,   02,  "City",  "positive",
  "M",      "Mohammad", NA,          "Ali",          1942, 01,   17,  "City",  "negative",
  NA,       "Jose",     "Sanchez",   "Lopez",        1995, 01,   06,  "City",  "negative",
  "M",      "Abubakar", NA,          "Abullahi",     1960, 01,   01,  "River", "positive",
  "F",      "Maria",    "Salinas",   "Contreras",    1955, 03,   03,  "River", "positive"
  )

The cases dataset has 9 records of patients who are awaiting test results.

The test_results dataset has 14 records and contains the column result, which we want to add to the records in cases based on probabilistic matching of records.

Probabilistic matching

The fastLink() function from the fastLink package can be used to apply a matching algorithm. Here is the basic informaton. You can read more detail by entering ?fastLink in your console.

  • Define the two dataframes for comparison to arguments dfA = and dfB =
  • In varnames = give all column names to be used for matching. They must all exist in both dfA and dfB.
  • In stringdist.match = give columns from those in varnames to be evaluated on string “distance”.
  • In numeric.match = give columns from those in varnames to be evaluated on numeric distance.
  • Missing values are ignored
  • By default, each row in either dataframe is matched to at most one row in the other dataframe. If you want to see all the evaluated matches, set dedupe.matches = FALSE. The deduplication is done using Winkler’s linear assignment solution.

Tip: split one date column into three separate numeric columns using day(), month(), and year() from lubridate package

The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.

Below, the data are matched on string distance across the name and district columns, and on numeric distance for year, month, and day of birth. A match threshold of 95% probability is set.

fl_output <- fastLink::fastLink(
  dfA = cases,
  dfB = results,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
  stringdist.match = c("first", "middle", "last", "district"),
  numeric.match = c("yr", "mon", "day"),
  threshold.match = 0.95)
## 
## ==================== 
## fastLink(): Fast Probabilistic Record Linkage
## ==================== 
## 
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## Calculating matches for each variable.
## Getting counts for parameter estimation.
##     Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
##     Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Deduping the estimated matches.
## Getting the match patterns for each estimated match.

Review matches

We defined the object returned from fastLink() as fl_output. It is of class list, and it actually contains several dataframes within it, detailing the results of the matching. One of these dataframes is matches, which contains the most likely matches across cases and results. You can access this “matches” dataframe with fl_output$matches. Below, it is saved as my_matches for ease of accessing later.

When my_matches is printed, you see two column vectors: the pairs of row numbers/indices (also called “rownames”) in cases (“inds.a”) and in results (“inds.b”) representing the best matches. If a row number from a datafrane is missing, then no match was found in the other at the specified match threshold.

# print matches
my_matches <- fl_output$matches
my_matches
##   inds.a inds.b
## 1      1      1
## 2      2      2
## 3      3      3
## 4      4      4
## 5      8      8
## 6      7      9
## 7      6     10
## 8      5     12

Things to note:

  • Matches occurred despite slight differences in name spelling and dates of birth:
    • “Tony” matched to “Anthony”
    • “Maria” matched to “Marialisa”
    • “Betty” matched to “Elizabeth”
    • “Olivier Laurent De Bordeaux” matched to “Oliver Laurent De Bordow” (missing date of birth ignored)
  • One row from cases (for “Blessing Adebayo”, row 9) had no good match in results, so it is not present in my_matches.

Join based on the probabilistic matches

To use these matches to join results to cases, one strategy is:

  1. Use left_join() to join my_matches to cases (matching rownames in cases to “inds.a” in my_matches)
  2. Then use another left_join() to join results to cases (matching the newly-acquired “inds.b” in cases to rownames in results)

Before the joins, we should clean the three datasets:

  • Both dfA and dfB should have their row numbers (“rowname”) converted to a proper column
  • Both the columns in my_matches are converted to class character, so they can be joined to the character rownames
# Clean data prior to joining
#############################

# convert cases rownames to a column 
cases_clean <- cases %>% rownames_to_column()

# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()  

# convert all columns in matches dataset to character, so they can be joined to the rownames
matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))



# Join matches to dfA, then add dfB
###################################
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))

# column(s) from dfB are added 
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

As performed using the code above, the resulting dataframe complete will contain all columns from both cases and results. Many will be appended with suffixes “.x” and “.y”, because the column names would otherwise be duplicated.

Alternatively, to achieve only the “original” 9 records in cases with the new column(s) from results, use select() on results before the joins, so that it contains only rownames and the columns that you want to add to cases (e.g. the column result).

cases_clean <- cases %>% rownames_to_column()

results_clean <- results %>%
  rownames_to_column() %>% 
  select(rowname, result)    # select only certain columns 

matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))

# joins
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

If you want to subset either dataset to only the rows that matched, you can use the codes below:

cases_matched <- cases[my_matches$inds.a,]  # Rows in cases that matched to a row in results
results_matched <- results[my_matches$inds.b,]  # Rows in results that matched to a row in cases

Or, to see only the rows that did not match:

cases_not_matched <- cases[!rownames(cases) %in% my_matches$inds.a,]  # Rows in cases that did NOT match to a row in results
results_not_matched <- results[!rownames(results) %in% my_matches$inds.b,]  # Rows in results that did NOT match to a row in cases

Probabilistic deduplication

Probabilistic matching can be used to deduplicate a dataset as well. See the page on deduplication for other methods of deduplication.

Here we began with the cases dataset, but are now calling it cases_dup, as it has 2 additional rows that could be duplicates of previous rows: See “Tony” with “Anthony”, and “Marialisa Rodrigues” with “Maria Rodriguez”.

Run the same fastLink() command as before, but compare the cases_dup dataframe to itself. When the two dataframes provided are identical, the function assumes you want to de-duplicate.

## Run fastLink on the same dataset
dedupe_output <- fastLink(
  dfA = cases_dup,
  dfB = cases_dup,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
  stringdist.match = c("first", "middle", "last", "district"),
  numeric.match = c("yr", "mon", "day")
)
## 
## ==================== 
## fastLink(): Fast Probabilistic Record Linkage
## ==================== 
## 
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## dfA and dfB are identical, assuming deduplication of a single data set.
## Setting return.all to FALSE.
## 
## Calculating matches for each variable.
## Getting counts for parameter estimation.
##     Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
##     Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Calculating the posterior for each pair of matched observations.
## Getting the match patterns for each estimated match.

fl.out must be of class fastLink.dedupe, or in other words, the result of either fastLink().

Now, you can review the potential duplicates with getMatches(). Provide the dataframe as both dfA = and dfB =, and provide the output of the fastLink() function as fl.out =.

## Run getMatches()
cases_dedupe <- getMatches(
  dfA = cases_dup,
  dfB = cases_dup,
  fl.out = dedupe_output)

See the right-most column, which indicates the duplicate IDs - the final two rows are identified as being likely duplicates of rows 2 and 3.

To return the row numbers of rows which are likely duplicates, you can count the number of rows per unique value in the dedupe.ids column, and then filter to keep only those with more than one row. In this case this leaves rows 2 and 3.

cases_dedupe %>% 
  count(dedupe.ids) %>% 
  filter(n > 1)
##   dedupe.ids n
## 1          2 2
## 2          3 2

To inspect the whole rows of the likely duplicates, put the row number in this command:

# displays row 2 and all likely duplicates of it
cases_dedupe[cases_dedupe$dedupe.ids == 2,]   
##    gender   first middle  last   yr mon day district dedupe.ids
## 2       M Anthony     B. Smith 1970   9  19    River          2
## 10      M    Tony     B. Smith 1970   9  19    River          2

Resources

The dplyr page on joins

See this vignette on fastLink at the package’s Github page

Publication describing methodolgy of fastLink

Publication describing RecordLinkage package

Characters and strings

This page demonstrates use of the stringr package to evaluate and manage character (strings).

  1. Evaluate and extract by position - str_length(), str_sub(), word()
  2. Combine, order, arrange - str_c(), str_glue(), str_order()
  3. Modify and replace - str_sub(), str_replace_all()
  4. Adjust length - str_pad(), str_trunc(), str_wrap()
  5. Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
  6. Search for patterns - str_detect(), str_subset(), str_match()
  7. Regular expressions (regex)

For ease of display most examples are shown acting on a short defined character vector, however they can easily be applied/adapted to a column within a dataset.

Much of this page is adapted from this online vignette

Preparation

Install or load the stringr package.

# install/load the stringr package
pacman::p_load(
  stringr,    # many functions for handling strings
  tidyverse,  # for optional data manipulation
  tools)      # alternative for converting to title case

Handle by position

Extract by character position

Use str_sub() to return only a part of a string. The function takes three main arguments:

  1. the character vector(s)
  2. start position
  3. end position

A few notes on position numbers:

  • If a position number is positive, the position is counted starting from the left end of the string.
  • If a position number is negative, it is counted starting from the right end of the string.
  • Position numbers are inclusive.
  • Positions extending beyond the string will be truncated (removed).

Below are some examples applied to the string “pneumonia”:

# start and end third from left (3rd letter from left)
str_sub("pneumonia", 3, 3)
## [1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
## [1] ""
# 6th from left, to the 1st from right
str_sub("pneumonia", 6, -1)
## [1] "onia"
# 5th from right, to the 2nd from right
str_sub("pneumonia", -5, -2)
## [1] "moni"
# 4th from left to a position outside the string
str_sub("pneumonia", 4, 15)
## [1] "umonia"

Extract by word position

To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
                      "My stomach hurts",
                      "Severe ear pain")

# extract 1st to 3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
## [1] "I just got"       "My stomach hurts" "Severe ear pain"

Replace by character position

str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

word <- "pneumonia"

# convert the third and fourth characters to X 
str_sub(word, 3, 4) <- "XX"

word
## [1] "pnXXmonia"

An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

words <- c("pneumonia", "tubercolosis", "HIV")

# convert the third and fourth characters to X 
str_sub(words, 3, 4) <- "XX"

words
## [1] "pnXXmonia"    "tuXXrcolosis" "HIXX"

Evaluate length

str_length("abc")
## [1] 3

Alternatively, use nchar() from base R

Unite, split, and arrange

This section covers:

  • Using str_c(), str_glue(), and unite() to combine strings
  • Using str_order() to arrange strings
  • Using str_split() and separate() to split strings

Combine strings

To combine or concatenate multiple strings into one string, we suggest using str_c from stringr.

str_c("String1", "String2", "String3")
## [1] "String1String2String3"

The argument sep = inserts characters between each input (e.g. a comma or newline "\n")

str_c("String1", "String2", "String3", sep = ", ")
## [1] "String1, String2, String3"

The argument collapse = is relevant if producing multiple combined elements in the output. The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts.

sep displays between the respective string inputs, while collapse displays between the elements produced.

In this example:

  • The sep value goes between each first and last name
  • The collapse value goes between each person
first_names <- c("abdul", "fahruk", "janice") 
last_names  <- c("hussein", "akinleye", "musa")

# sep displays between the respective input strings, while collapse displays between the elements produced
str_c(first_names, last_names, sep = " ", collapse = ";  ")
## [1] "abdul hussein;  fahruk akinleye;  janice musa"

When printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:

# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
## abdul hussein;
## fahruk akinleye;
## janice musa

Dynamic strings

Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.

  • All content goes between quotation marks str_glue("")
  • Any dynamic code or calls of defined values are within curly brackets {} within the parentheses. There can be many curly brackets.
  • To display quotes within the outer quotation marks, use single quotes (e.g. when providing date format)
  • You can use \n within the quotes to force a new line
  • You use format() to adjust date display, and use Sys.Date() to display the current date

A simple example, of a dynamic plot caption:

str_glue("The linelist is current to {format(Sys.Date(), '%d %b %Y')} and includes {nrow(linelist)} cases.")
## The linelist is current to 10 Mar 2021 and includes 5888 cases.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the codes are long.

str_glue("Data source is the confirmed case linelist as of {current_date}.\nThe last case was reported hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
         current_date = format(Sys.Date(), '%d %b %Y'),
         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
         )
## Data source is the confirmed case linelist as of 10 Mar 2021.
## The last case was reported hospitalized on 30 Apr 2015.
## 0 cases are missing date of onset and not shown

Pulling from a dataframe

Sometimes, it is useful to pull data from dataframe and have it pasted together in sequence. Below is an example using this dataset to make a summary output of jurisdictions and the new and total cases:

# make case table
case_table <- data.frame(
  zone       = c("Zone 1", "Zone 2", "Zone 3", "Zone 4", "Zone 5"),
  new_cases = c(3, 0, 7, 0, 15),
  total_cases = c(40, 4, 25, 10, 103))

Option 1:

Use str_c() with the dataframe and column names. Provide sep and collapse arguments.

str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
## [1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

Add text “New Cases:” to the beginning of the summary by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).

str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
## [1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

Option 2:

You can achieve a similar result with str_glue(), with newlines added automatically:

str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)")
## Zone 1: 3 new cases (40 total cases)
## Zone 2: 0 new cases (4 total cases)
## Zone 3: 7 new cases (25 total cases)
## Zone 4: 0 new cases (10 total cases)
## Zone 5: 15 new cases (103 total cases)

To use str_glue() but have more control (e.g. to use double newlines), wrap it within str_c() and adjust the collapse value. You may need to print using cat() to correctly print the newlines.

case_summary <- str_c(str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)"), collapse = "\n\n")

cat(case_summary) # print
## Zone 1: 3 new cases (40 total cases)
## 
## Zone 2: 0 new cases (4 total cases)
## 
## Zone 3: 7 new cases (25 total cases)
## 
## Zone 4: 0 new cases (10 total cases)
## 
## Zone 5: 15 new cases (103 total cases)

Unite columns

Within a dataframe, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().

Provide the name of the new united column. Then provide the names of the columns you wish to unite.

  • By default the separator used in the united column is underscore _, but this can be changed with the sep argument.
  • remove = removes the input columns from the data frame (TRUE by default)
  • na.rm = removes missing values while uniting (FALSE by default)

Below, we unite the three symptom columns in this dataframe.

df_split %>% 
  unite(
    col = "all_symptoms",         # name of the new united column
    c("sym_1", "sym_2", "sym_3"), # columns to unite
    sep = ", ",                   # separator to use in united column
    remove = TRUE,                # if TRUE, removes input cols from the data frame
    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
  )
##   case_ID                all_symptoms outcome
## 1       1     jaundice, fever, chills Success
## 2       2        chills, aches, pains Failure
## 3       3                       fever Failure
## 4       4         vomiting, diarrhoea Success
## 5       5 bleeding, from, gums, fever Success
## 6       6      rapid, pulse, headache Success

Split

To split a string based on a pattern, use str_split(). It evaluates the strings and returns a list of character vectors consisting of the newly-split values.

The simple example below evaluates one string and splits it into three. By default it returns a list with one element (a character vector) for each string provided. If simplify = TRUE it returns a character matrix.

One string is provided, and returned is a list with one element, which is a character vector with three values

str_split(string = "jaundice, fever, chills",
          pattern = ",")
## [[1]]
## [1] "jaundice" " fever"   " chills"

You can assign this as a named object, and access the nth symptom. To access a specific symptom you can use syntax like this: the_split_return_object[[1]][2], which would access the second symptom from the first evaluated string (“fever”). See the R Basics page for more detail on accessing elements.

pt1_symptoms <- str_split("jaundice, fever, chills", ",")

pt1_symptoms[[1]][2]  # extracts 2nd value from 1st (and only) element of the list
## [1] " fever"

If multiple strings are evaluated, there will be more than one element in the returned list.

symptoms <- c("jaundice, fever, chills",     # patient 1
              "chills, aches, pains",        # patient 2 
              "fever",                       # patient 3
              "vomiting, diarrhoea",         # patient 4
              "bleeding from gums, fever",   # patient 5
              "rapid pulse, headache")       # patient 6

str_split(symptoms, ",")                     # split each patient's symptoms
## [[1]]
## [1] "jaundice" " fever"   " chills" 
## 
## [[2]]
## [1] "chills" " aches" " pains"
## 
## [[3]]
## [1] "fever"
## 
## [[4]]
## [1] "vomiting"   " diarrhoea"
## 
## [[5]]
## [1] "bleeding from gums" " fever"            
## 
## [[6]]
## [1] "rapid pulse" " headache"

To return a “character matrix” instead, which may be useful if creating dataframe columns, set the argument simplify = TRUE as shown below:

str_split(symptoms, ",", simplify = TRUE)
##      [,1]                 [,2]         [,3]     
## [1,] "jaundice"           " fever"     " chills"
## [2,] "chills"             " aches"     " pains" 
## [3,] "fever"              ""           ""       
## [4,] "vomiting"           " diarrhoea" ""       
## [5,] "bleeding from gums" " fever"     ""       
## [6,] "rapid pulse"        " headache"  ""

You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits (from the left side) to 2 splits. The further commas remain within the second split.

str_split(symptoms, ",", simplify = TRUE, n = 2)
##      [,1]                 [,2]            
## [1,] "jaundice"           " fever, chills"
## [2,] "chills"             " aches, pains" 
## [3,] "fever"              ""              
## [4,] "vomiting"           " diarrhoea"    
## [5,] "bleeding from gums" " fever"        
## [6,] "rapid pulse"        " headache"

Note - the same outputs can be achieved with str_split_fixed(), in which you do not give the simplify argument, but must instead designate the number of columns (n).

str_split_fixed(symptoms, ",", n = 2)

Split columns

Use separate() from dplyr within a dataframe, to split one character column into other columns.

If we have a simple dataframe df consisting of a case ID column, one character column with symptoms, and one outcome column:

First, provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.

  • sep = the separator, can be a character, or a number (interpreted as the character position to split at).
  • remove = FALSE by default, removes the input column
  • convert = FALSE by default, will cause string “NA”s to become NA
  • extra = this controls what happens if there are more values created by the separation than new columns named.
    • extra = "warn" means you will see a warning but it will drop excess values (the default)
    • extra = "drop" means the excess values will be dropped with no warning
    • extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data

An example with extra = "merge" - no data is lost and third symptoms are combined into the second new named column:

# third symptoms combined into second new column
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1          sym_2 outcome
## 1       1           jaundice  fever, chills Success
## 2       2             chills   aches, pains Failure
## 3       3              fever           <NA> Failure
## 4       4           vomiting      diarrhoea Success
## 5       5 bleeding from gums          fever Success
## 6       6        rapid pulse       headache Success

When the default extra = "drop" is used below, a warning is given but the third symptoms are lost:

# third symptoms are lost
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",")
## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1      sym_2 outcome
## 1       1           jaundice      fever Success
## 2       2             chills      aches Failure
## 3       3              fever       <NA> Failure
## 4       4           vomiting  diarrhoea Success
## 5       5 bleeding from gums      fever Success
## 6       6        rapid pulse   headache Success

CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

Arrange

Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

# strings
health_zones <- c("Alba", "Takota", "Delta")

# return the alphabetical order
str_order(health_zones)
## [1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
## [1] "Alba"   "Delta"  "Takota"

To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

To arrange strings in order of their value in another column, use arrange() like this: TO DO

base R functions

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. They act similarly to str_c() but the syntax differs - in the code each part is separated by a comma. The parts are either text (in quotes) or pre-defined code objects. For example:

n_beds <- 10
n_masks <- 20

paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks.")
## [1] "Regional hospital needs 10 beds and 20 masks."

sep and collapse arguments can be adjusted. By default, sep is a space, unless using paste0() where there is no space between parts.

Adjust length

Pad

Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.

# ICD codes of differing length
ICD_codes <- c("R10.13",
               "R10.819",
               "R17")

# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
## [1] "R10.13 " "R10.819" "R17    "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
## [1] "R10.13." "R10.819" "R17...."

For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") 
## [1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")

Truncate

str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")
## [1] "Symp...ing"

Standardize length

Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then a very short value is padded to achieve length of 6.

# ICD codes of differing length
ICD_codes   <- c("R10.13",
                 "R10.819",
                 "R17")

# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2
## [1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3
## [1] "R10.13" "R10..." "R17   "

Remove leading/trailing whitespace

Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

# ID numbers with excess spaces on right
IDs <- c("provA_1852  ", # two excess spaces
         "provA_2345",   # zero excess spaces
         "provA_9460 ")  # one excess space

# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
## [1] "provA_1852" "provA_2345" "provA_9460"

Remove repeated whitespace within

Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

# original contains excess spaces within string
str_squish("  Pt requires   IV saline\n") 
## [1] "Pt requires IV saline"

Enter ?str_trim, ?str_pad in your R console to see further details.

Wrap into paragraphs

Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."

str_wrap(pt_course, 40)
## [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."

The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

cat(str_wrap(pt_course, 40))
## Symptom onset 1/4/2020 vomiting chills
## fever. Pt saw traditional healer in
## home village on 2/4/2020. On 5/4/2020
## pt symptoms worsened and was admitted
## to Lumta clinic. Sample was taken and pt
## was transported to regional hospital on
## 6/4/2020. Pt died at regional hospital
## on 7/4/2020.

Change case

Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_upper(), and str_to_title(), as shown below:

str_to_upper("California")
## [1] "CALIFORNIA"
str_to_lower("California")
## [1] "california"

Using *base** R, the above can also be achieved with toupper(), tolower().

Title case

Transforming the string so each word is capitalized can be achieved with str_to_title():

str_to_title("go to the US state of california ")
## [1] "Go To The Us State Of California "

Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

tools::toTitleCase("This is the US state of california")
## [1] "This is the US State of California"

You can also use str_to_sentence(), which capitalizes only the first letter of the string.

str_to_sentence("the patient must be transported")
## [1] "The patient must be transported"

Patterns

Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

Detect a pattern

Use str_detect() as below to detect presence/absence of a pattern within a string. First list the string or vector to search in, and then the pattern to look for. Note that by default the search is case sensitive!

str_detect("primary school teacher", "teach")
## [1] TRUE

The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

str_detect("primary school teacher", "teach", negate = TRUE)
## [1] FALSE

To ignore case/capitalization, wrap the pattern within regex() and within regex() add the argument ignore_case = T.

str_detect("Teacher", regex("teach", ignore_case = T))
## [1] TRUE

When str_detect() is applied to a character vector/column, it will return a TRUE/FALSE for each of the values in the vector.

# a vector/column of occupations 
occupations <- c("field laborer",
                 "university professor",
                 "primary school teacher & tutor",
                 "tutor",
                 "nurse at regional hospital",
                 "lineworker at Amberdeen Fish Factory",
                 "physican",
                 "cardiologist",
                 "office worker",
                 "food service")

# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

If you need to count these, apply sum() to the output. This counts the number TRUE.

sum(str_detect(occupations, "teach"))
## [1] 1

To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern, as shown below:

sum(str_detect(occupations, "teach|professor|tutor"))
## [1] 3

If you need to make a long list of search terms, you can combine them using str_c() and sep = |, define this is a character object, and reference it later more succinctly. The example below includes possible occupation search terms for frontline medical providers.

# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
                               "cna", "pa", "physician assistant", "mental health",
                               "emergency department technician", "resp therapist", "respiratory",
                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
                               "rehab", "activity", "elderly", "subacute", "sub acute",
                                "clinic", "post acute", "therapist", "extended care",
                                "dental", "dential", "dentist", sep = "|")

occupation_med_frontline
## [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"

This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

sum(str_detect(occupations, occupation_med_frontline))
## [1] 2

Base R string search functions

The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve regex() function).

Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

Convert commas to periods

Here is an example of using gsub() to convert commas to periods in a vector of numbers. This could be useful if your data come from much of the world other from the United States or Great Britain.

The inner gsub() which acts first on lengths is converting any periods to no space "“. The period character”." has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub() in which commas are replaced by periods.

lengths <- c("2.454,56", "1,2", "6.096,5")

as.numeric(gsub(pattern = ",",                # find commas     
                replacement = ".",            # replace with periods
                x = gsub("\\.", "", lengths)  # vector with other periods removed (periods escaped)
                )
           )                                  # convert outcome to numeric

Replace all

Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated, then the pattern to be replaced, and then the replacement value. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

outcome <- c("Karl: dead",
            "Samantha: dead",
            "Marco: not dead")

str_replace_all(outcome, "dead", "deceased")
## [1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"

To replace a pattern with NA, use str_replace_na(). The function str_replace() replaces only the first instance of the pattern within each evaluated string.

Detect within logic

Within case_when()

str_detect() is often used within case_when() (from dplyr). Let’s say the occupations are a column in the linelist called occupations. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

df <- df %>% 
  mutate(is_educator = case_when(
    # term search within occupation, not case sensitive
    str_detect(occupations,
               regex("teach|prof|tutor|university",
                     ignore_case = TRUE))              ~ "Educator",
    # all others
    TRUE                                               ~ "Not an educator"))

As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

df <- df %>% 
  # value in new column is_educator is based on conditional logic
  mutate(is_educator = case_when(
    
    # occupation column must meet 2 criteria to be assigned "Educator":
    # it must have a search term AND NOT any exclusion term
    
    # Must have a search term AND
    str_detect(occupations,
               regex("teach|prof|tutor|university", ignore_case = T)) &              
    # Must NOT have an exclusion term
    str_detect(occupations,
               regex("admin", ignore_case = T),
               negate = T)                          ~ "Educator"
    
    # All rows not meeting above criteria
    TRUE                                            ~ "Not an educator"))

Locate pattern position

To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

str_locate("I wish", "sh")
##      start end
## [1,]     5   6

Like other str functions, there is an "_all" version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

phrases <- c("I wish", "I hope", "he hopes", "He hopes")

str_locate(phrases, "h" )     # position of *first* instance of the pattern
##      start end
## [1,]     6   6
## [2,]     3   3
## [3,]     1   1
## [4,]     4   4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
## [[1]]
##      start end
## [1,]     6   6
## 
## [[2]]
##      start end
## [1,]     3   3
## 
## [[3]]
##      start end
## [1,]     1   1
## [2,]     4   4
## 
## [[4]]
##      start end
## [1,]     4   4

Extract a match

str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

str_extract_all(occupations, "teach|prof|tutor")
## [[1]]
## character(0)
## 
## [[2]]
## [1] "prof"
## 
## [[3]]
## [1] "teach" "tutor"
## 
## [[4]]
## [1] "tutor"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## character(0)

str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

str_extract(occupations, "teach|prof|tutor")
##  [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA      NA

Subset and count

Subset, Count

Aligned functions include str_subset() and str_count().

str_subset() returns the actual values which contained the pattern:

str_subset(occupations, "teach|prof|tutor")
## [1] "university professor"           "primary school teacher & tutor" "tutor"

`str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
##  [1] 0 1 2 1 0 0 0 0 0 0

Regex groups

Groups within strings

str_match() TBD

Regex and special characters

Regular expressions, or “regex”, is a concise language for describing patterns in strings.

Much of this section is adapted from this tutorial and this cheatsheet

Special characters

Backslash \ as escape

The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

Note - thus, if you want to display a backslash, you must escape it’s meaning with *another backslash. So you must write two backslashes \\ to display one.

Special characters

Special character Represents
"\\" backslash
"\n" a new line (newline)
"\"" double-quote within double quotes
'\'' single-quote within single quotes
"\| grave accent| carriage return| tab| vertical tab"` backspace

Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

Regular expressions (regex)

If you are not familiar with it, a regular expression can look like an alien language:

A regular expression is applied to extract specific patterns from unstructured text - for example medical notes, chief complaint, matient history, or other free text columns in a dataset.

There are four basic tools one can use to create a basic regular expression:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Groups

Character sets

Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

Character set Matches for
"[A-Z]" any single capital letter
"[a-z]" any single lowercase letter
"[0-9]" any digit
[:alnum:] any alphanumeric character
[:digit:] any numeric digit
[:alpha:] any letter (upper or lowercase)
[:upper:] any uppercase letter
[:lower:] any lowercase letter

Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

Meta characters

Meta characters are shorthand for character sets. Some of the important ones are listed below:

Meta character Represents
"\\s" a single space
"\\w" any single alphanumeric character (A-Z, a-z, or 0-9)
"\\d" any single numeric digit (0-9)

Quantifiers

Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,

  • "A{2}" will return instances of two capital A letters.
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
  • "A{2,}" will return instances of two or more capital A letters.
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)

Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

# test string for quantifiers
test <- "A-AA-AAA-AAAA"

When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

str_extract_all(test, "A{2}")
## [[1]]
## [1] "AA" "AA" "AA" "AA"

When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

str_extract_all(test, "A{2,4}")
## [[1]]
## [1] "AA"   "AAA"  "AAAA"

With the quantifier +, groups of one or more are returned:

str_extract_all(test, "A+")
## [[1]]
## [1] "A"    "AA"   "AAA"  "AAAA"

Relative position

These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

str_extract_all(test, "")
## [[1]]
##  [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement Matches to
"(?<=b)a" “a” that is preceded by a “b”
"(?<!b)a" “a” that is NOT preceded by a “b”
"a(?=b)" “a” that is followed by a “b”
"a(?!b)" “a” that is NOT followed by a “b”

Groups

Capturing groups in your regular expression is a way to have a more organized output upon extraction.

Regex examples

Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."

This expression matches to all words (any character until hitting non-character such as a space):

str_extract_all(pt_note, "[A-Za-z]+")
## [[1]]
##  [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"    "emergency"   "ward"        "at"          "on"         
## [10] "Patient"     "presented"   "with"        "radiating"   "abdominal"   "pain"        "from"        "LR"          "quadrant"   
## [19] "Patient"     "skin"        "was"         "pale"        "cool"        "and"         "clammy"      "Patient"     "temperature"
## [28] "was"         "degrees"     "farinheit"   "Patient"     "pulse"       "rate"        "was"         "bpm"         "and"        
## [37] "thready"     "Respiratory" "rate"        "was"         "per"         "minute"

The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

str_extract_all(pt_note, "[0-9]{1,2}")
## [[1]]
##  [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"
str_split(pt_note, ".")
## [[1]]
##   [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [44] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [87] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [130] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [173] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [216] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [259] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [302] "" "" "" "" "" "" "" ""

This expression will extract all sentences (assuming first letter is capitalized, and the sentence ends with a period). The pattern reads in English as: "A capital letter followed by some lowercase letters, a space, some letters, a space,

str_extract_all(pt_note, "[A-Z][a-z]+\\s\\w+\\s\\d{1,2}\\s\\w+\\s*\\w*")
## [[1]]
## character(0)

You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

Also see this tutorial.

Resources

A reference sheet for stringr functions can be found here

A vignette on stringr can be found here

De-duplication

Overview

This page covers the following subjects:

  1. Identifying and removing duplicate rows
  2. “Slicing” and keeping only certain rows (min, max, random…), also from each group
  3. “Rolling-up”, or combining values from multiple rows into one

Preparation

Load packages

pacman::p_load(
  tidyverse,   # deduplication, grouping, and slicing functions
  janitor,     # function for reviewing duplicates
  stringr)      # for string searches, can be used in "rolling-up" values

Example dataset

For demonstration, we will use the fake dataset below. It is a record of COVID-19 phone encounters, including with contacts and with cases.

  • The first two records are 100% complete duplicates including duplicate recordID (computer glitch)
  • The second two rows are duplicates, in all columns except for recordID
  • Several people had multiple phone encounters, at various dates/times and as contacts or cases
  • At each encounter, the person was asked if they had ever had symptoms, and some of this information is missing.

Here is the code to create the dataset:

obs <- data.frame(
  recordID  = c(1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
  personID  = c(1,1,2,2,3,2,4,5,6,7,2,1,3,3,4,5,5,7,8),
  name      = c("adam", "adam", "amrish", "amrish", "mariah", "amrish", "nikhil", "brian", "smita", "raquel", "amrish",
                "adam", "mariah", "mariah", "nikhil", "brian", "brian", "raquel", "natalie"),
  date      = c("1/1/2020", "1/1/2020", "2/1/2020", "2/1/2020", "5/1/2020", "5/1/2020", "5/1/2020", "5/1/2020", "5/1/2020","5/1/2020", "2/1/2020",
                "5/1/2020", "6/1/2020", "6/1/2020", "6/1/2020", "6/1/2020", "7/1/2020", "7/1/2020", "7/1/2020"),
  time      = c("09:00", "09:00", "14:20", "14:20", "12:00", "16:10", "13:01", "15:20", "14:20", "12:30", "10:24",
                "09:40", "07:25", "08:32", "15:36", "15:31", "07:59", "11:13", "17:12"),
  encounter = c(1,1,1,1,1,3,1,1,1,1,2,
                2,2,3,2,2,3,2,1),
  purpose   = c("contact", "contact", "contact", "contact", "case", "case", "contact", "contact", "contact", "contact", "contact",
                "case", "contact", "contact", "contact", "contact", "case", "contact", "case"),
  symptoms_ever = c(NA, NA, "No", "No", "No", "Yes", "Yes", "No", "Yes", NA, "Yes",
                    "No", "No", "No", "Yes", "Yes", "No","No", "No"))

And here is the dataset:

Deduplication

This tab uses the dataset from the Preparation tab to describe how to review and remove duplicate rows in a dataframe. It also show how to handle duplicate elements in a vector.

Examine duplicate rows

To quickly review rows that have duplicates, you can use get_dupes() from the janitor package. By default, all columns are considered when duplicates are evaluated - rows returned are 100% duplicates considering the values in all columns.

In the obs dataframe, the first two rows are 100% duplicates - they have the same value in every column (including the recordID column, which is supposed to be unique - it must be some computer glitch). The returned dataframe automatically includes a new column dupe_count, showing the number of rows with that combination of duplicate values.

# 100% duplicates across all columns
obs %>% 
  janitor::get_dupes()

However, if we choose to ignore recordID, the 3rd and 4th rows rows are also duplicates of each other. That is, they have the same values in all columns except for recordID. You can specify specific columns to be ignored in the function using a - minus symbol.

# Duplicates when column recordID is not considered
obs %>% 
  janitor::get_dupes(-recordID)         # if multiple columns, wrap them in c()

You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.

*Scroll left for more rows**

# duplicates based on name and purpose columns ONLY
obs %>% 
  janitor::get_dupes(name, purpose)

See ?get_dupes for more details, or see this online reference

Keep only unique rows

To keep only unique rows of a dataframe, use distinct() from dplyr. Rows that are duplicates are removed such that only the first of such rows is kept. By default, “first” means the highest rownumber (order of rows top-to-bottom). Only unique rows are kept. In the example below, one duplicate row (the first row, for “adam”) has been removed (n is now 18, not 19 rows).

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(across(-recordID), # reduces dataframe to only unique rows (keeps first one of any duplicates)
           .keep_all = TRUE) 

# if outside pipes, include the data as first argument 
# distinct(obs)

CAUTION: If using distinct() on grouped data, the function will apply to each group.

Deduplicate based on specific columns

You can also specify columns to be the basis for de-duplication. In this way, the de-duplication only applies to rows that are duplicates within the specified columns. Unless specified with .keep_all = TRUE, all columns not mentioned will be dropped.

In the example below, the de-duplication only applies to rows that have identical values for name and purpose columns. Thus, “brian” has only 2 rows instead of 3 - his first “contact” encounter and his only “case” encounter. To adjust so that brian’s latest encounter of each purpose is kept, see the tab on Slicing within groups.

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(name, purpose, .keep_all = TRUE) %>%  # keep rows unique by name and purpose, retain all columns
  arrange(name)                                  # arrange for easier viewing

Duplicate elements in a vector

The function duplicated() from base R will evaluate a vector (column) and return a logical vector of the same length (TRUE/FALSE). The first time a value appears, it will return FALSE (not a duplicate), and subsequent times that value appears it will return TRUE. Note how NA is treated the same as any other value.

x <- c(1, 1, 2, NA, NA, 4, 5, 4, 4, 1, 2)
duplicated(x)
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

To return only the duplicated elements, you can use brackets to subset the original vector:

x[duplicated(x)]
## [1]  1 NA  4  4  1  2

To return only the unique elements, use unique() from base R. To remove NAs from the output, nest na.omit() within unique().

unique(x)           # alternatively, use x[!duplicated(x)]
## [1]  1  2 NA  4  5
unique(na.omit(x))  # remove NAs 
## [1] 1 2 4 5

with base R

To return duplicate rows

In base R, you can also see which rows are 100% duplicates in a dataframe df with the command duplicated(df) (returns a logical vector of the rows).

Thus, you can also use the base subset [ ] on the dataframe to see the duplicated rows with df[duplicated(df),] (don’t forget the comma, meaning that you want to see all columns!).

To return unique rows

See the notes above. To see the unique rows you add the logical negator ! in front of the duplicated() function:
df[!duplicated(df),]

To return rows that are duplicates of only certain columns

Subset the df that is within the duplicated() parentheses, so this function will operate on only certain columns of the df.

To specify the columns, provide column numbers or names after a comma (remember, all this is within the duplicated() function).

Be sure to keep the comma , outside after the duplicated() function as well!

For example, to evaluate only columns 2 through 5 for duplicates: df[!duplicated(df[, 2:5]),]
To evaluate only columns name and purpose for duplicates: df[!duplicated(df[, c("name", "purpose)]),]

Slicing

To “slice” a dataframe is useful in de-duplication if you have multiple rows per functional group (e.g. per “person”) and you only want to analyze one or some of them. Think of slicing as a filter on the rows, by row number/position.

The basic slice() function accepts a number n. If positive, only the nth row is returned. If negative, all rows except the nth are returned.

Variations include:

  • slice_min() and slice_max() - to keep only the row with the minimium or maximum value of the specified column. Also worked with ordered factors.
  • slice_head() and slice_tail - to keep only the first or last row
  • slice_sample() - to keep only a random sample of the rows

Use arguments n = or prop = to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. slice(df, n = 2)). See ?slice for more information.

Other arguments:

.order_by = used in slice_min() and slice_max() this is a column to order by before slicing.
with_ties = TRUE by default, meaning ties are kept.
.preserve = FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
weight_by = Optional, numeric column to weight by (bigger number more likely to get sampled). Also replace = for whether sampling is done with/without replacement.

TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:is not empty.

NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.

Here, the basic slice() function is used to keep only the 4th row:

obs %>% 
  slice(4)  # keeps the 4th row only

Slice with groups

The slice_*() functions can be very useful if applied to a grouped dataframe, as the slice operation is performed on each group separately. Use the function group_by() in conjunction with slice() to group the data and then take a slice from each group.

This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use group_by() with key columns that are the same, and then use a slice function on a column that will differ among the grouped rows.

In the example below, to keep only the latest encounter per person, we group the rows by name and then use slice_max() with n = 1 on the date column. Be aware! To apply a function like slice_max() on dates, the date column must be class Date.

By default, “ties” (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set with_ties = FALSE. We get back only one row per person.

CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.

DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.

obs %>% 
  group_by(name) %>%       # group the rows by 'name'
  slice_max(date,          # keep row per group with maximum date value 
            n = 1,         # keep only the single highest row 
            with_ties = F) # if there's a tie (of date), take the first row

Breaking “ties”

Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.

# Example of multiple slice statements to "break ties"
obs %>%
  group_by(name) %>%
  
  # FIRST - slice by latest date
  slice_max(date, n = 1, with_ties = TRUE) %>% 
  
  # SECOND - if there is a tie, select row with latest time; ties prohibited
  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)

In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.

TIP: To use slice_max() or slice_min() on a “character” column, mutate it to an ordered factor class!

Keep all but mark them

If you want to keep all records but mark only some for analysis, consider a two-step approach utilizing a unique recordID/encounter number:

  1. Reduce/slice the orginal dataframe to only the rows for analysis. Save/retain this reduced dataframe.
  2. In the original dataframe, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced dataframe.
# 1. Define dataframe of rows to keep for analysis
obs_keep <- obs %>%
  group_by(name) %>%
  slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person


# 2. Mark original dataframe
obs_marked <- obs %>%

  # make new dup_record column
  mutate(dup_record = case_when(
    
    # if record is in obs_keep dataframe
    recordID %in% obs_keep$recordID ~ "For analysis", 
    
    # all else marked as "Ignore" for analysis purposes
    TRUE                            ~ "Ignore"))

# print
obs_marked

Calculate row completeness

Create a column that contains a metric for the row’s completeness (non-missingness). This could be helpful when deciding which rows to prioritize over others when de-duplicating/slicing.

In this example, “key” columns over which you want to measure completeness are saved in a vector of column names.

Then the new column key_completeness is created with mutate(). The new value in each row is defined as a calculated fraction: the number of non-missing values in that row among the key columns, divided by the number of key columns.

This involves the function rowSums() from base R. Also used is ., which within piping refers to the dataframe at that point in the pipe (in this case, it is being subset with brackets []).

*Scroll to the right to see more rows**

# create a "key variable completeness" column
# this is a *proportion* of the columns designated as "key_vars" that have non-missing values

key_cols = c("personID", "name", "symptoms_ever")

obs %>% 
  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 

Roll-up values

This section describes:

  1. How to “roll-up” values from multiple rows into just one row, with some variations
  2. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell

This tab uses the example dataset from the Preparation tab.

Roll-up values into one row

The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

  • A suffix is appended to all new columns ("_roll" in this example)
  • If you want to show only unique values per cell, then wrap the na.omit() with unique()
  • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)

Scroll to the left to see more rows

# "Roll-up" values into one row per group (per "personID") 
cases_rolled <- obs %>% 
  
  # create groups by name
  group_by(personID) %>% 
  
  # order the rows within each group (e.g. by date)
  arrange(date, .by_group = TRUE) %>% 
  
  # For each column, paste together all values within the grouped rows, separated by ";"
  summarise(
    across(everything(),                           # apply to all columns
           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values

The result is one row per group (ID), with entries arranged by date and pasted together.

This variation shows unique values only:

# Variation - show unique values only 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                                   # apply to all columns
           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values

This variation appends a suffix to each column.
In this case "_roll" to signify that it has been rolled:

# Variation - suffix added to column names 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                
           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names

Overwrite values/hierarchy

If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.

# CLEAN CASES
#############
cases_clean <- cases_rolled %>% 
    
    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
               
               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
      .keep = "unused")                                             # old columns removed, leaving only _mod columns

Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.

Probabilistic de-duplication

Sometimes, you may want to identify “likely” duplicates based on similarity (e.g. string “distance”) across several columns such as name, age, sex, date of birth, etc. You can apply a probabilistic matching algorithm to identify likely duplicates.

See the page on Joining data for an explanation on this method. The section on Probabilistic Matching contains an example of applying these algorithms to compare a dataframe to itself, thus performing probabilistic de-duplication.

Resources

Much of the information in this page is adapted from these resources and vignettes online:

datanovia

dplyr tidyverse reference

cran janitor vignette

Iteration and loops

This page will introduce two approaches to iterative operations - using for loops and using the package purrr. Iterative operations help you perform repetitive tasks, reduce the chances of error, reduce code length, and maximize efficiency.

  1. for loops are a common tool in programming languages, but less so in R because it relies more on functions (still worth understanding though!)
  2. purrr operations can replace most for loops with more clear code

Preparation

Load packages

pacman::p_load(
     rio,
     here, 
     purrr,
     tidyverse
)

Load data

linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 rows are displayed:

for loops

As an epidemiologist, it is a common need to repeat analyses on sub-groups (e.g. jurisdictions or sub-populations). Iterating with a for loop is one method to automate this process.

A for loop has three core parts:

  1. The container for the results (optional)
  2. The sequence of items to iterate through
  3. The operations to conduct per item in the sequence

The basic syntax is: for (item in sequence) {do operations using item}. Note the parentheses and the curly brackets. The results could be printed to console, or stored in a container R object.

Container

Sometimes the results of your for loop will be printed to the console or Plots pane. Other times, you will want to store the outputs in a container for later use. Such a container could be a vector, a data frame, or even a list.

It is most efficient to create the container for the results before even beginning the for loop. In practice, this means creating an empty vector, data frame, or list. These can be created with the functions vector() for vectors or lists, or with matrix() and data.frame() for a data frame.

Empty vector
Say you want to store the median delay-to-admission for each hospital in a new vector. Use vector() and specify the class as either “double” (to hold numbers), “character”, or “logical”. In this case we would use “double” and set the length to be the number of expected outputs (length of the sequence, or in this case the number of unique hospitals in the data set).

delays <- vector(mode = "double",
                 length = length(unique(linelist$hospital))) # this is the number of unique hospitals in the dataset

Empty data frame

You can make an empty data frame by specifying the number of rows and columns like this:

delays <- data.frame(matrix(ncol = 2, nrow = 3))

Empty list

Say you want to store some plots created by a for loop in a list. You actually initialize the container using the same vector() command as above, but with mode = "list". Specify the length however you wish.

plots <- vector(mode = "list", length = 16)

Sequence

This is the “for” part of a for loop - the operations will run for each item in the sequence. The sequence can be a series of character values (e.g. of jurisdictions, diseases, etc), or R object names (e.g. column names or list element names), or the sequence can be a series of consecutive numbers (e.g. 1,2,3,4,5). Each approach has their own utilities, described below.

Sequence of character values

In this case, the loop is applied for each value in a character vector.

# make vector of the hospital names
hospital_names <- unique(linelist$hospital)
hospital_names # print
## [1] "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)" "Other"                               
## [4] "Missing"                              "Military Hospital"                    "Central Hospital"

The value of the “item”, whose value changes each iteration of the loop, proceeds through each value in the character vector. In this example, the term hosp represents a value from the vector hospital_names. For the first iteration of the loop the value would be “Port Hospital”. TFor the second loop it would be “St. Mark’s Maternity Hospital (SMMH)”. And so on…

# 'for loop'
for (hosp in hospital_names){       # sequence
  
  # OPERATIONS HERE
  
}

Sequence of names

This is a variation on the character sequence above, in which the names of an existing R object are extracted and become the character vector. For example, the column names of a data frame. This is useful because you know the names are exact matches to the column names and thus can be used to index the R object within the for loop.

Below, the sequence is the names() (column names) of linelist. Inside the for loop, the column names are used to index (subset) linelist one-at-a-time. In this example, we demonstrate an if conditional statement as part of the operations code within the for loop. If the column of interest is class Numeric, then the mean of the column is printed to the console. If the column is not class Numeric then another statement is printed to the console.

A note on indexing with column names - whenever referencing the column itself (e.g. within mean()) do not just write “col”! col is just the character column name! To refer to the entire column you use the column name as an index* on linelist via linelist[[col]].

for (col in names(linelist)){ 
  
  # if column is class Numeric, print the mean value
  if(is.numeric(linelist[[col]])) {
    print(mean(linelist[[col]], na.rm=T))     # don't forget to index with [[col]]
    } else {        
    print("Column not numeric")            # if column is not numeric, print this
  }
  
}
## [1] "Column not numeric"
## [1] 16.56165081521739
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] 16.20082744354422
## [1] "Column not numeric"
## [1] 16.14425673734414
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] -13.23380634001095
## [1] 8.469637508784336
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] 53.14758831521739
## [1] 124.7961956521739
## [1] 21.19412364130435
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] "Column not numeric"
## [1] 38.54143924908743
## [1] "Column not numeric"
## [1] 48.11272478311054
## [1] 2.011888586956522

Sequence of numbers

Use this approach if you plan to do more complicated operations or to store the results of the for loop. In this approach, the sequence is a series of consecutive numbers. Thus, the value of the “item” is not a character value (e.g. “Central Hospital” or “date_onset”) but is a number. This is useful for looping through data frames, as you can use the numeric item inside the for loop to index the dataframe by row number.

For example, let’s say that you want to loop over every row in your data frame and extract certain information. Your “items” would be numeric row numbers. The process could be explained as “for every item in a sequence of numbers from 1 to the total number of rows in my data frame, do X”. The first iteration of the loop, i would be 1. For the second iteration, i would be 2, etc.

Whew, that was a mouthful of words! Here is what it looks like in code: for (i in seq_len(nrow(linelist)) {} where i represents the item and seq_len() produces a sequence of consecutive numbers from 1 to the number of rows in linelist. If using this approach on a named vector (not a data frame), use seq_along(), like for (i in seq_along(hospital_names) {}.

for (i in seq_len(nrow(linelist)) {  # use on a data frame
  # OPERATIONS HERE
}  

The below code actually returns numbers, which become the value of i in their respective loop.

seq_along(hospital_names)  # use on a named vector
## [1] 1 2 3 4 5 6

Operations

This is code within the for loop. You want this to run for each item in the sequence. Therefore, be careful that every part of your code that changes by the item is correctly coded such that it changes! Remember to use [[ ]] for indexing. For example,

Below, we use seq_len() on the linelist. The gender and age of each row is pasted together and stored the container character vector cases_demographics.

# create container to store results - a character vector
cases_demographics <- vector(mode = "character", length = nrow(linelist))

# the for loop
for (i in seq_len(nrow(linelist))){
  
  # OPERATIONS
  # extract values from linelist for i using indexing
  row_gender  <- linelist$gender[[i]]
  row_age     <- linelist$age_years[[i]]    # don't forget to index!
  
  # store the gender-age in container at indexed location
  cases_demographics[[i]] <- str_c(row_gender, row_age, sep = ", ") 

}  # end for loop

# display first 10 rows of container
head(cases_demographics, 10)
##  [1] "m, 1"  "f, 4"  "m, 21" "f, 2"  "m, 27" "m, 25" "f, 18" "f, 2"  "m, 20" "f, 4"

Printing

Note that to print from within a for loop you will likely need to explicitly wrap with the function print().

In this example below, the sequence is an explicit character vector, which is used to subset the linelist by hospital.The results are not stored in a container, but rather print to console with the print() function.

for (hosp in hospital_names){ 
  hospital_cases <- linelist %>% filter(hospital == hosp)
  print(nrow(hospital_cases))
}
## [1] 1762
## [1] 422
## [1] 885
## [1] 1469
## [1] 896
## [1] 454

Testing your for loop

To test your loop, you can make a temporarily assignment of the item, such as i <- 10 or hosp <- "Central Hospital" and run your operations code to see if the expected results are produced.

Looping plots

To put all three components together (container, sequence, and operations) let’s try to plot an epicurve for each hospital (see page on Epidemic curves.

Of course, we can make an epicurve of all the cases using the incidence2 package as below:

# create 'incidence' object
outbreak <- incidence2::incidence(   
     x = linelist,                   # dataframe - complete linelist
     date_index = date_onset,        # date column
     interval = "week",              # aggregate counts weekly
     groups = gender,                # group values by gender
     na_as_group = TRUE)             # missing gender is own group

# plot epi curve
plot(outbreak,                       # name of incidence object
     fill = "gender",                # color bars by gender
     color = "black",                # outline color of bars
     title = "Outbreak of ALL cases" # title
     )

To produce a separate plot for each hospital’s cases, we can put this epicurve code within a for loop.

First, we save a named vector of the unique hospital names, hospital_names. The for loop will run once for each of these names (for (hosp in hospital_names)). Each iteration of the for loop, the current hospital name from the vector will be represented as “hosp” for use within the loop.

Within the loop, you can write R code as normal, but use the item (hosp in this case) knowing that its value will be changing. Within this loop:

  • A filter() is applied to linelist, such that column hospital must equal the current value of hosp
  • The incidence object is created on the filtered linelist
  • The plot for the current hospital is created, with an auto-adjusting title
  • The plot for the current hospital is temporarily saved and then printed
  • The loop then moves onward to repeat with the next hospital in hospital_names
# make vector of the hospital names
hospital_names <- unique(linelist$hospital)

# for each name ("hosp") in hospital_names, create and print the epi curve
for (hosp in hospital_names) {
     
     # create incidence object specific to the current hospital
     outbreak_hosp <- incidence2::incidence(
                    x = linelist %>% filter(hospital == hosp),   # linelist is filtered to the current hospital
                    date_index = date_onset,
                    interval = "week", 
                    groups = gender,
                    na_as_group = TRUE
     )
     
     # Create and save the plot. Title automatically adjusts to the current hospital
     plot_hosp <- plot(outbreak_hosp,
                       fill = "gender",
                       color = "black",
                       title = stringr::str_glue("Epidemic of cases admitted to {hosp}")
                       )
     
     # print the plot for the current hospital
     print(plot_hosp)

} # end the for loop when it has been run for every hospital in hospital_names 

Tracking progress of a loop

A loop with many iterations can run for many minutes or even hours. Thus, it can be helpful to print the progress to the R console. This code can be placed within the loop to print every 100th number.

# loop with code to print progress every 100 iterations
for (row in 1:nrow(linelist)){

  # print progress
  if(row %% 100==0){    # The %% operator is the remainder
    print(row)

}

purrr

One approach to iterative operations is the purrr package. If you are using a for loop, you can probably do it with purrr! For example, applying a model to different datasets, producing plots or maps for various jurisdictions, or iterating data management tasks (across columns or subsets).

In this section we will explain the basic syntax and a few key functions, and demonstrate the use by making plots and by performing data operations such as importing/exporting multiple Excel sheets and CSV files.

See the Resource section for more extensive trainings. Also, here is the purrr online cheatsheet.

Load packages

purrr is part of the tidyverse, so there is no need to install/load a separate package.

pacman::p_load(
  rio,            # import/export
  here,           # relative filepaths
  tidyverse,      # data mgmt and viz
  writexl,        # write Excel file with multiple sheets
  readxl          # import Excel with multiple sheets
  )

One core purrr function is map(), which “maps” (applies) a function to each input element. There are several variations on map() for specific use cases, as detailed below.

The key arguments are:

  • .x = this is the input - e.g. a vector, data frame, or list upon which the .f function will be iteratively applied
  • .f = this is the function to apply to each element of the .x input

The basic form is map(.x, ~.f), which in a pipe chain can also look like:

df %>% 
  map(~.f)

You may encounter the syntax .x (or simply .) within the .f function as a placeholder for the .x input of that iteration.

Mapping a function across columns

Below, we map() the function t.test() across numeric columns, comparing values by gender. Recall from the page on Descriptive analysis that t.test() can take inputs in a formula format, such as NUMERIC_COLUMN ~ BINARY COLUMN. In this example, we do the following:

  • The numeric columns of interest are selected from linelist - these are the .x inputs
  • The function t.test() is supplied as the .f function mapped to each numeric column (note tilde ~ in front)
  • Within the parentheses of t.test():
    • the . represents the current column being mapped
    • the second ~ is part of the t-test equation
    • the linelist$gender is the binary column for t-test comparison, not is a separate column not included in select() so that it is not included on the left side of the t.test equation.

The result is a list of t-test results - one element for each numeric column. Only the first one of six is shown for demonstration purposes.

# Results are saved as a list
t.test_results <- linelist %>% 
  select(age, wt_kg, ht_cm, ct_blood, temp) %>%  # keep only the numeric columns to map across
  map(.f = ~t.test(.x ~ linelist$gender))              # t.test function, with equation NUMERIC ~ CATEGORICAL

t.test_results[[1]] # show first result 
## 
##  Welch Two Sample t-test
## 
## data:  .x by linelist$gender
## t = -22.813808812458, df = 4803.371114955, p-value < 2.2204460493e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.032928518914238 -6.761591694025840
## sample estimates:
##   mean in group f   mean in group m 
## 12.49840142095915 19.89566152742919

If you wanted the p-values only, you can modify the .f function by appending $p.value to the t.test() output. This way, the value that map() returns to it’s output list is only the p-value and not the entire t.test output.

linelist %>% 
  select(age, wt_kg, ht_cm, ct_blood, temp) %>% 
  map(.f = ~t.test(. ~ linelist$gender)$p.value)
## $age
## [1] 1.821713221074871e-109
## 
## $wt_kg
## [1] 5.768922697194617e-198
## 
## $ht_cm
## [1] 6.116975742624153e-142
## 
## $ct_blood
## [1] 0.7107848639794567
## 
## $temp
## [1] 0.5572371740897707

Note:
Remember that if you want to apply a function to only certain columns in a data frame, you can also use mutate() and across(), as explained in the Grouping data page. Below is an example of applying as.character() to only the “age” columns. Note the placement of the parentheses and commas.

# convert columns with column name containing "age" to class Character
linelist <- linelist %>% 
  mutate(across(contains("age"), as.character))  

Custom functions

You will often want to create your own function to provide to map(). In fact, we already did this! While t.test() was an existing function, when we used it in map() the second time we modified it by adding $p.value to the end, which in effect changed the function provided to map().

One example of making a purely custom plotting function to provide to map() is shown below.

Let’s say we want to create simple epicurves for each hospital. To do this using purrr, our .f function can be ggplot() and extensions with + as usual. As the output of map() is always a list, the plots are stored in a list. They can be extracted and plotted with the ggarrange() function from the ggpubr package (documentation.

# load package for plotting elements from list
pacman::p_load(ggpubr)

# map across the vector of 6 hospital "names" (created earlier)
# use the ggplot function specified
# output is a list with 6 ggplots

my_plots <- map(
  .x = hospital_names,
  .f = ~ggplot(data = linelist %>% filter(hospital == .x))+
                geom_histogram(aes(x = date_onset)) +
                labs(title = .x)
)

# print the ggplots (they are stored in a list)
ggarrange(plotlist = my_plots, ncol = 2, nrow = 3)

If this code style looks too messy, you can achieve the same result by saving your specific ggplot() command as a custom user-defined function, for example we can name it make_epicurve()). This function is then used within the map(). .x will be iteratively replaced by the hospital name, and used as hosp_name in the make_epicurve() function. See the page on Writing functions.

make_epicurve <- function(hosp_name){
  
  ggplot(data = linelist %>% filter(hospital == hosp_name)) +
    geom_histogram(aes(x = date_onset)) +
    theme_classic()+
    labs(title = hosp_name)
  
}
# mapping
my_plots <- map(hospital_names, ~make_epicurve(hosp_name = .x))

# print the ggplots (they are stored in a list)
ggarrange(plotlist = my_plots, ncol = 2, nrow = 3)

Split and combine datasets

Split dataset and export CSV files

Here is a more complex purrr map() example. Let’s say that we want to create a separate linelist for each hospital and export each as a separate CSV file. This is a task that would be arduous if done copy-paste by hand in Excel, and involve a lot of code if each filter() and export() was a distinct command (imagine if we wanted to make a linelist for each hospital-gender!).

Below, we do the following steps:

Use group_split() (from dplyr) to split the linelist by hospital of admission - the output is a list with one “element” per hospital subset (in this case, each element is a dataframe)

linelist_split <- linelist %>% 
  group_split(hospital)

You can View(linelsit_split) and see that this list contains 6 dataframe each representing the cases from one hospital.

However, note that the dataframes in the list do not have names! This is standard behavior of map(), but we want each to have a name, and to use that name when saving the CSV file. So, we use pull() (from purrr) to extract the ’hospitalcolumn from each data frame in the list. Then, to be safe, we convert the values to character and then useunique()` to get the name for the dataset.

names(linelist_split) <- linelist_split %>%
  purrr::map(.f = ~pull(.x,hospital)) %>% # Pull out Species variable
  purrr::map(.f = ~as.character(.x)) %>% # Convert factor to character
  purrr::map(.f = ~unique(.x))

We can now see that each of the list elements has a name. These names can be access via names(linelist_split).

names(linelist_split)
## [1] "Central Hospital"                     "Military Hospital"                    "Missing"                             
## [4] "Other"                                "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"

Lastly, we take the vector of names (shown above) and will use map() to iterate through them, applying export() function on that element of the list linelist_split and saving the correct name. Here is how it works:

  • We begin with the vector of character names, passed to map() as .x (the sequence)
  • The .f function is export() (rio package, see Import and export page), which needs a dataframe and a filepath to write to
  • The input .x (the hospital name) is used within .f to extract/index that specific element of linelist_split list. This results in only one data frame at a time being provided to export().
    • For example, at “Military Hospital”, then linelist_split[[.x]] is actually linelist_split[["Military Hospital"]], thus returning the second element of linelist_split - all the cases from that hospital.
  • The filepath provided to export() is dynamic via use of str_glue() (see Characters and strings page):
    • here() is used to get the base of the filepath and specify the “data” folder (note single quotes to not interrupt the str_glue() double quotes)
    • Then a slash /, and then again the .x which prints the current hospital name to make the file identifiable
    • Finally the extension “.csv” which export() uses to create a CSV file
names(linelist_split) %>%
  map(.f = ~export(linelist_split[[.x]], file= str_glue("{here('data')}/{.x}.csv")))

Now you can see that each file is saved in the “data” folder of the R Project “Epi_R_handbook”!

Split dataset and export as Excel sheets

To export the hospital linelists as an Excel workbook with one linelist per sheet, we can just provide the named list linelist_split to the write_xlsx() function from the writexl package. This has the ability to save one Excel workbook with multiple sheets. The list element names are automatically applied as the sheet names.

linelist_split %>% 
  writexl::write_xlsx(path = here("data", "hospital_linelists.xlsx"))

You can now open the Excel file and see that each hospital has its own sheet.

More than one group_split() column

If you wanted to split the linelist by more than one grouping column, such as to produce subset linelist by intersection of hospital AND gender, you will need a different approach to naming the list elements. This involves collecting the unique “group keys” using group_keys() from dplyr - they are returned as a data frame. Then you can combine the group keys into values with unite() as shown below, and assign these conglomerate names to linelist_split.

# split linelist by unique hospital-gender combinations
linelist_split <- linelist %>% 
  group_split(hospital, gender)

# extract group_keys() as a dataframe
groupings <- linelist %>% 
  group_by(hospital, gender) %>%       
  group_keys()

groupings      # show unique groupings 
## # A tibble: 18 x 2
##    hospital                             gender
##  * <chr>                                <chr> 
##  1 Central Hospital                     f     
##  2 Central Hospital                     m     
##  3 Central Hospital                     <NA>  
##  4 Military Hospital                    f     
##  5 Military Hospital                    m     
##  6 Military Hospital                    <NA>  
##  7 Missing                              f     
##  8 Missing                              m     
##  9 Missing                              <NA>  
## 10 Other                                f     
## 11 Other                                m     
## 12 Other                                <NA>  
## 13 Port Hospital                        f     
## 14 Port Hospital                        m     
## 15 Port Hospital                        <NA>  
## 16 St. Mark's Maternity Hospital (SMMH) f     
## 17 St. Mark's Maternity Hospital (SMMH) m     
## 18 St. Mark's Maternity Hospital (SMMH) <NA>

Now we combine the groupings together, separated by dashes, and assign them as the names of list elements in linelist_split. This takes some extra lines as we replace NA with “Missing”, use unite() from dplyr to combine the column values together (separated by dashes), and then convert into an un-named vector so it can be used as names of linelist_split.

# Combine into one name value 
names(linelist_split) <- groupings %>% 
  mutate(across(everything(), replace_na, "Missing")) %>%  # replace NA with "Missing" in all columns
  unite("combined", sep = "-") %>%                         # Unite all column values into one
  setNames(NULL) %>% 
  as_vector() %>% 
  as.list()

Reading in multiple Excel sheets

For reference, if you want to use purrr to import multiple Excel workbook sheets and combine them (the reverse of above), you can use the package readxl as demonstrated below.

First, extract the sheet names from the Excel workbook. Use excel_sheets() from the readxl package. You provide the filepath within the parentheses.

sheet_names <- readxl::excel_sheets(here("data", "hospital_linelists.xlsx"))

sheet_names
## [1] "Central Hospital"              "Military Hospital"             "Missing"                       "Other"                        
## [5] "Port Hospital"                 "St. Mark's Maternity Hospital"

Now we can use this vector of sheet names to iteratively import() the sheets. The argument used to import a specific Excel workbook sheet is given .x, which is the sheet name currently being mapped on. Finally, because we have used map(), the sheets have been saved as in a list - each data frame is one element in the list.

sheets_as_list <- sheet_names %>% 
  map(.f = ~rio::import(here("data", "hospital_linelists.xlsx"), which = .x))

Assuming each data frame has the same columns, we can combine the six data frames with a simple bind_rows() command (from dplyr). Optionally, add .id = "sheet_name" to have a column specifying which sheet each row came from originally.

combined_sheets <- bind_rows(sheets_as_list)

Resources

for loops with Data Carpentry

The R for Data Science page on iteration

Vignette on write/read Excel files

A purrr tutorial

purrr cheatsheet

TO DO group_split collapse pluck

set_names() vars = linelist %>% select_if(is.numeric) %>% select(-cyl, - year) %>% names() %>% set_names()

IV Analysis

Missing data

This page will cover:

  1. Useful functions for assessing missingness
  2. Assess missingness in a dataframe
  3. How to filter out rows by missingness
  4. Plotting missingness over time
  5. Handling how NA is displayed in plots
  6. Missing value imputation

Preparation

Load packages

pacman::p_load(
  rio,           # import/export
  tidyverse,     # data mgmt and viz
  naniar         # assess and visualize missingness
)

Load data

linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 lines are viewable below.

NA

NA

In R, missing values are represented by a reserved (special) value - NA. Note that this is typed without quotes. “NA” is different and is just a normal character value (also a Beatles lyric from the song Hey Jude).

Your data may have other ways of representing missingness, such as “99”, or “Missing”, or “Unknown” - you may even have empty character value "" which looks “blank”, or a single space " “. Be aware of these and consider whether to convert them to NA during importation or data cleaning (e.g. with na_if()). You may also want to convert the other way - changing all NA to”Missing" or similar (e.g. with replace_na() or fct_explicit_na())

Versions of NA

Most of the time, all you need to know/use is NA and is.na(). However sometimes you may encounter the need for variations on NA listed below. One example is when creating a new column with case_when() and deciding to assign NA as the outcome for some logical criteria (see Cleaning data and core functions page for tips on case_when()).

There may be circumstances where NA on the right-hand side (RHS) of case_when() is rejected because the other RHS values are a class such as Character. The RHS values in a case_when() command must all be of the same class. Thus, if you have character outcomes on the RHS like “Confirmed”, “Suspect”, “Probable” and then NA - you will get an error. Instead of NA, put “Missing”, or else you must put NA_character_. Likewise for integers, use NA_integer_ instead of NA. NA should work for dates and logical. See the R documentation on NA for more information.

  • NA
  • NA_character_
  • NA_integer_
  • NA_real_
  • NA_complex_

na.rm = TRUE

When you apply mathematical functions such as max(), min(), sum() or mean(), any NA value present over-rides the calculation and NA is returned. You must specify the arguemnt na.rm = TRUE within the function to remove any NA values from the calculation. This default behavior is intentional, so that you are alerted if any of your data are missing.

NULL

NULL is another reserved value in R. It is the logical representation of a statement that is neither true nor false. It is returned by expressions or functions whose values are undefined. Generally do not assign NULL as a value, unless writing functions or perhaps writing Shiny to return NULL in specific scenarios. Null-ness can be assessed using is.null() and conversion can made with as.null().

See this blog post on the difference between NULL and NA.

NaN

Impossible values are represented by the special value - NaN. An example of this is when you force R to divide 0 by 0. It has an infinite value. You can assess this with is.nan(). You may also encounter complementary functions include is.infinite() and is.finite()

Inf

Say you have a column z that contains these values: z <- c(1, 22, NA, Inf, NaN, 5)

If you want to use max() on the column, you can use na.rm = TRUE as described above to remove the NA from the calculation, but the Inf and NaN remain and Inf will be returned. You can use brackets [ ] to subset the vector such that only finite values are used for the calculation: max(z[is.finite(z)]).

z <- c(1, 22, NA, Inf, NaN, 5)
max(z)                           # returns NA
max(z, na.rm=T)                  # returns Inf
max(z[is.finite(z)])             # returns 22

Examples

R command Outcome
5 / 0 Inf
0 / 0 NaN
5 / NA NA
5 / Inf |0NA - 5|NAInf / 5|Infclass(NA)| "logical"class(NaN)| "numeric"class(Inf)| "numeric"class(NULL)` “NULL”

“NAs introduced by coercion” is a common warning message. This can happen if you attempt to make an illegal conversion.

as.numeric (c("10", "20", "thirty", "40"))
## Warning: NAs introduced by coercion
## [1] 10 20 NA 40

NULL is ignored in a vector.

my_vector <- c(25, NA, 10, NULL)  # define
my_vector                         # print
## [1] 25 NA 10

Variance of one number results in NA.

var(22)
## [1] NA

Useful functions

The following are useful base R functions when assessing or handling missing values:

is.na() and !is.na()

Use is.na()to identify missing values, or use its opposite (with ! in front) to identify non-missing values. These both return a logical value (TRUE or FALSE). Remember that you can sum() the resulting vector to count the number TRUE, e.g. sum(is.na(linelist$date_outcome)).

my_vector <- c(1, 4, 56, NA, 5, NA, 22)
is.na(my_vector)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
!is.na(my_vector)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
sum(is.na(my_vector))
## [1] 2

na.omit()

This function, if applied to a dataframe, will remove rows with any missing values. It is also from base R.
If applied to a vector, it will remove NA values from the vector it is applied to. For example:

sum(na.omit(my_vector))
## [1] 88

na.rm = TRUE

In R, most mathematical functions will by default include NA in calculations, which results in the function returning NA. This is designed intentionally, in order to make you aware that you have missing data.

You can avoid this by removing missing values from the calculation, by including the argument na.rm = TRUE (na.rm stands for “remove NA”).

mean(my_vector)
## [1] NA
mean(my_vector, na.rm = TRUE)
## [1] 17.6

Assess missingness in a dataframe

You can use the package naniar to assess and visualize missingness in the dataframe linelist.

# install and/or load package
pacman::p_load(naniar)

Statistics

To find the percent of all values that are missing use pct_miss(). Use n_miss() to get the number of missing values.

pct_miss(linelist)
## [1] 6.376245471014492

The two functions below return the percent of rows with any missing value, or that are entirely complete, respectively. Reember that NA means missing, and that `"" or " " will register as non-missing.

pct_miss_case(linelist)   # also see n_complete() for counts
## [1] 67.56114130434783
pct_complete_case(linelist) # see n_complete
## [1] 32.43885869565217

Visualizing missingness

The gg_miss_var() function will show you the number of missing values in each column. You can add a bare column name to the argument facet = if desired to see the plot by groups. By default, counts are shown instead of percents (show_pct = FALSE). You can also add labs as a normal ggplot with +labs().

gg_miss_var(linelist, show_pct = TRUE)

You can use vis_miss() to visualize the dataframe as a heatmap, showing whether each value is missing or not. As usual, you can also select() certain columns from the dataframe and the provide that to the function.

vis_miss(linelist)

Explore and visualize missingness relationships

How do you visualize something that is not there??? By default, ggplot removes points with missing values from plots.

naniar offers a solution via geom_miss_point(). When creating a scatterplot of two columns, records with one of the values missing and the other present are shown by setting the missing values to 10% lower than the lowest value in the column, and coloring them distinctly.

In the scatterplot below, the red dots are records where the value for one column is present but the value for the other column is missing.

ggplot(
  linelist, 
  aes(x = age_years,             
      y = temp)) +     # column to show missingness
  geom_miss_point()

To assess missingness in the dataframe stratified by another column, consider gg_miss_fct(), which returns a heatmap of percent missingness in the dataframe by a factor/categorical (or date) column:

gg_miss_fct(linelist, age_cat5)

This function can also be used with a date column to see missingness over time:

gg_miss_fct(linelist, date_onset)

“Shadow” columns

Another way to visualize missingness in one column by values in a second column is using the “shadow” that naniar can create. bind_shadow() creates a binary NA/not NA column for every existing column, and binds all these new columns to the original dataset with the appendix "_NA". This doubles the number of columns - see below:

shadowed_linelist <- linelist %>% 
  bind_shadow()

names(shadowed_linelist)
##  [1] "case_id"                 "generation"              "date_infection"          "date_onset"              "date_hospitalisation"   
##  [6] "date_outcome"            "outcome"                 "gender"                  "age"                     "age_unit"               
## [11] "age_years"               "age_cat"                 "age_cat5"                "hospital"                "lon"                    
## [16] "lat"                     "infector"                "source"                  "wt_kg"                   "ht_cm"                  
## [21] "ct_blood"                "fever"                   "chills"                  "cough"                   "aches"                  
## [26] "vomit"                   "temp"                    "time_admission"          "bmi"                     "days_onset_hosp"        
## [31] "case_id_NA"              "generation_NA"           "date_infection_NA"       "date_onset_NA"           "date_hospitalisation_NA"
## [36] "date_outcome_NA"         "outcome_NA"              "gender_NA"               "age_NA"                  "age_unit_NA"            
## [41] "age_years_NA"            "age_cat_NA"              "age_cat5_NA"             "hospital_NA"             "lon_NA"                 
## [46] "lat_NA"                  "infector_NA"             "source_NA"               "wt_kg_NA"                "ht_cm_NA"               
## [51] "ct_blood_NA"             "fever_NA"                "chills_NA"               "cough_NA"                "aches_NA"               
## [56] "vomit_NA"                "temp_NA"                 "time_admission_NA"       "bmi_NA"                  "days_onset_hosp_NA"

These “shadow” columns can be used to plot the density of proportion of values that are missing, by any another column X. For example, the plot below shows the proportion of records missing days_onset_hosp (number of days from symptom onset to hospitalisation), by that record’s value in date_hospitalisation. Essentially, you are plotting the density of the x-axis column, but stratifying the results (color =) by a shadow column of interest. This analysis works best if the x-axis is a numeric or date column.

ggplot(
  shadowed_linelist,                   # dataframe with shadow columns
  aes(x = date_hospitalisation,        # numeric or date column
      colour = age_years_NA)) +        # shadow column of interest
  geom_density()                       # plots the density curves

You can also use these “shadow” columns to stratify a statistical summary, as shown below:

linelist %>%
  bind_shadow() %>%                # create the shows cols
  group_by(date_outcome_NA) %>%    # shadow col for stratifying
  summarise_at(.vars = c("age_years"),                        # variable of interest for calculations
               .funs = c("mean", "sd", "var", "min", "max"),  # stats to calculate
               na.rm = TRUE)       # other arguments for the stat calculations
## # A tibble: 2 x 6
##   date_outcome_NA  mean    sd   var   min   max
## * <fct>           <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 !NA              16.2  12.7  161.     0    90
## 2 NA               15.9  12.9  166.     0    83

An alternative way to plot the proportion of values in one column, including missingness, is given below. It does not involve naniar. This example shows percent of weekly observations that are missing in a column):

  1. Aggregate the data into a useful time unit (days, weeks, etc.), summarizing the proportion of observations with NA (and any other values of interest)
  2. Plot the proportion missing as a line using ggplot()

Below, we take the linelist, add a new column for week, group the data by week, and then calculate the percent of that week’s records where the value is missing. (note: if you want % of 7 days the calculation would be slightly different).

outcome_missing <- linelist %>%
  mutate(week = lubridate::floor_date(date_onset, "week")) %>%   # create new week column
  group_by(week) %>%                                             # group the rows by week
  summarize(                                                     # summarize each week
    n_obs = n(),                                                     # number of records
    
    outcome_missing = sum(is.na(outcome) | outcome == ""),       # number of records missing the value
    outcome_p_miss  = outcome_missing / n_obs,                   # proportion of records missing the value
  
    outcome_dead    = sum(outcome == "Death", na.rm=T),          # number of records as dead
    outcome_p_dead  = outcome_dead / n_obs) %>%                  # proportion of records as dead
  
  tidyr::pivot_longer(-week, names_to = "statistic") %>%         # pivot all columns except week, to long format for ggplot
  filter(stringr::str_detect(statistic, "_p_"))                  # keep only the proportion values

Then we plot the proportion missing as a line, by week

ggplot(data = outcome_missing)+
    geom_line(
      aes(x = week, y = value, group = statistic, color = statistic),
      size = 2,
      stat = "identity")+
    labs(title = "Weekly outcomes",
         x = "Week",
         y = "Proportion of weekly records") + 
     scale_color_discrete(
       name = "",
       labels = c("Died", "Missing outcome"))+
    scale_y_continuous(breaks = c(seq(0,1,0.1)))+
  theme_minimal()+
  theme(
    legend.position = "bottom"
  )

Filter out rows with missing values

To quickly remove rows with missing values, use the dplyr function drop_na().

The original linelist has rnrow(linelist)` rows. The adjusted number of rows is shown below:

linelist %>% 
  drop_na() %>%     # remove rows with ANY missing values
  nrow()
## [1] 1910

You can specify to drop rows with missingness in certain columns:

linelist %>% 
  drop_na(date_onset) %>% # remove rows missing date_onset 
  nrow()
## [1] 5888

Multiple columns can be specified one after the other, or using this standard syntax:

linelist %>% 
  drop_na(contains("date")) %>% # remove rows missing values in any "date" column 
  nrow()
## [1] 3178

Handling NA in ggplot()

It is often wise to report the number of values excluded from a plot in a caption. Below is an example:

In ggplot(), you can add labs() and within it a caption =. In the caption, you can use str_glue() from stringr package to paste values together into a sentence dynamically so they will adjust to the data. An example is below:

  • Note the use of \n for a new line.
  • Note that if multiple column would contribute to values not being plotted (e.g. age or sex if those are reflected in the plot), then you must filter on those columns as well to correctly calculate the number not shown.
labs(
  title = "",
  y = "",
  x = "",
  caption  = stringr::str_glue(
  "n = {nrow(central_data)} from Central Hospital;
  {nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown."))  

Sometimes, it can be easier to save the string as an object in commands prior to the ggplot() command, and simply reference the named string object within the str_glue().

NA in factors

If your column of interest is a factor, use fct_explicit_na() from the forcats package to convert NA values to a character value. By default the character value is “(Missing)” but this can be adjusted via the na_level = arguement.

pacman::p_load(forcats)   # load package

linelist <- linelist %>% 
  mutate(gender = fct_explicit_na(gender, na_level = "Missing"))

levels(linelist$gender)
## [1] "f"       "m"       "Missing"

Imputation

Sometimes, when analyzing your data, it will be important to “fill in the gaps” and impute missing data While you can always simply analyze a dataset after removing all missing values, this can cause problems in many ways. Here are two examples:

  1. By removing all observations with missing values or variables with a large amount of missing data, you might reduce your power or ability to do some types of analysis. For example, as we discovered earlier, only a small fraction of the observations in our linelist dataset have no missing data across all of our variables. If we removed the majority of our dataset we’d be losing a lot of information! And, most of our variables have some amount of missing data–for most analysis it’s probably not reasonable to drop every variable that has a lot of missing data either.

  2. Depending on why your data is missing, analysis of only non-missing data might lead to biased or misleading results. For example, as we learned earlier we are missing data for some patients about whether they’ve had some important symptoms like fever or cough. But, as one possibility, maybe that information wasn’t recorded for people that just obviously weren’t very sick. In that case, if we just removed these observations we’d be excluding some of the healthiest people in our dataset and that might really bias any results.

It’s important to think about why your data might be missing in addition to seeing how much is missing. Doing this can help you decide how important it might be to impute missing data, and also which method of imputing missing data might be best in your situation.

Types of missing data

Here are three general types of missing data:

  1. Missing Completely at Random (MCAR). This means that there is no relationship between the probability of data being missing and any of the other variables in your data. The probability of being missing is the same for all cases This is a rare situation. But, if you have strong reason to believe your data is MCAR analyzing only non-missing data without imputing won’t bias your results (although you may lose some power). [TODO: consider discussing statistical tests for MCAR]

  2. Missing at Random (MAR). This name is actually a bit misleading as MAR means that your data is missing in a systematic, predictable way based on the other information you have. For example, maybe every observation in our dataset with a missing value for fever was actually not recorded because every patient with chills and and aches was just assumed to have a fever so their temperature was never taken. If true, we could easily predict that every missing observation with chills and aches has a fever as well and use this information to impute our missing data. In practice, this is more of a spectrum. Maybe if a patient had both chills and aches they were more likely to have a fever as well if they didn’t have their temperature taken, but not always. This is still predictable even if it isn’t perfectly predictable. This is a common type of missing data

  3. Missing not at Random (MNAR). Sometimes, this is also called Not Missing at Random (NMAR). This assumes that the probability of a value being missing is NOT systematic or predictable using the other information we have but also isn’t missing randomly. In this situation data is missing for unknown reasons or for reasons you don’t have any information about. For example, in our dataset maybe information on age is missing because some very elderly patients either don’t know or refuse to say how old they are. In this situation, missing data on age is related to the value itself (and thus isn’t random) and isn’t predictable based on the other information we have. MNAR is complex and often the best way of dealing with this is to try to collect more data or information about why the data is missing rather than attempt to impute it.

In general, imputing MCAR data is often fairly simple, while MNAR is very challenging if not impossible. Many of the common data imputation methods assume MAR.

Useful packages

Some useful packages for imputing missing data are Mmisc, missForest (which uses random forests to impute missing data), and mice (Multivariate Imputation by Chained Equations). For this section we’ll just use the mice package, which implements a variety of techniques. The maintainer of the mice package has published an online book about imputing missing data that goes into more detail here (https://stefvanbuuren.name/fimd/).

Here is the code to load the mice package:

pacman::p_load(mice)

Mean Imputation

Sometimes if you are doing a simple analysis or you have strong reason to think you can assume MCAR, you can simply set missing numerical values to the mean of that variable. Perhaps we can assume that missing temperature measurements in our dataset were either MCAR or were just normal values. Here is the code to create a new variable that replaces missing temperature values with the mean temperature value in our dataset. However, in many situations replacing data with the mean can lead to bias, so be careful.

linelist <- linelist %>%
  mutate(temp_replace_na_with_mean = replace_na(temp, mean(temp, na.rm = T)))

You could also do a similar process for replacing categorical data with a specific value. For our dataset, imagine you knew that all observations with a missing value for their outcome (which can be “Death” or “Recover”) were actually people that died (note: this is not actually true for this dataset):

linelist <- linelist %>%
  mutate(outcome_replace_na_with_death = replace_na(outcome, "Death"))

Regression imputation

A somewhat more advanced method is to use some sort of statistical model to predict what a missing value is likely to be and replace it with the predicted value. Here is an example of creating predicted values for all the observations where temperature is missing, but age and fever are not, using simple linear regression using fever status and age in years as predictors. In practice you’d want to use a better model than this sort of simple approach.

simple_temperature_model_fit <- lm(temp ~ fever + age_years, data = linelist)

#using our simple temperature model to predict values just for the observations where temp is missing
predictions_for_missing_temps <- predict(simple_temperature_model_fit,
                                        newdata = linelist %>% filter(is.na(temp))) 

Or, using the same modeling approach through the mice package to create imputed values for the missing temperature observations:

model_dataset <- linelist %>%
  select(temp, fever, age_years)  

temp_imputed <- mice(model_dataset,
                            method = "norm.predict",
                            seed = 1,
                            m = 1,
                            print = F)
## Warning: Number of logged events: 1
temp_imputed_values <- temp_imputed$imp$temp

This is the same type of approach by some more advanced methods like using the missForest package to replace missing data with predicted values. In that case, the prediction model is a random forest instead of a linear regression. You can use other types of models to do this as well. However, while this approach works well under MCAR you should be a bit careful if you believe MAR or MNAR more accurately describes your situation. The quality of your imputation will depend on how good your prediction model is and even with a very good model the variability of your imputed data may be underestimated.

LOCF and BOCF

Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are imputation methods for time series/longitudinal data. The idea is to take the previous observed value as a replacement for the missing data. When multiple values are missing in succession, the method searches for the last observed value.

[TO BE COMPLETED]

Multiple Imputation

The online book we mentioned earlier by the author of the mice package (https://stefvanbuuren.name/fimd/) contains a detailed explanation of multiple imputation and why you’d want to use it. But, here is a basic explanation of the method:

When you do multiple imputation, you create multiple datasets with the missing values imputed to plausible data values (depending on your research data you might want to create more or less of these imputed datasets, but the mice package sets the default number to 5). The difference is that rather than a single, specific value each imputed value is drawn from an estimated distribution (so it includes some randomness). As a result, each of these datasets will have slightly different different imputed values (however, the non-missing data will be the same in each of these imputed datasets). You still use some sort of predictive model to do the imputation in each of these new datasets (mice has many options for prediction methods including Predictive Mean Matching, logistic regression, and random forest) but the mice package can take care of many of the modeling details.

Then, once you have created these new imputed datasets, you can apply then apply whatever statistical model or analysis you were planning to do for each of these new imputed datasets and pool the results of these models together. This works very well to reduce bias in both MCAR and many MAR settings and often results in more accurate standard error estimates.

Here is an example of applying the Multiple Imputation process to predict temperature in our linelist dataset using a age and fever status (our simplified model_dataset from above): [Note from Daniel: this is not a very good model example and I’ll change it later]

# imputing missing values for all variables in our model_dataset, and creating 10 new imputed datasets
multiple_imputation = mice(
  model_dataset,
  seed = 1,
  m = 10,
  print = FALSE) 
## Warning: Number of logged events: 1
model_fit <- with(multiple_imputation, lm(temp ~ age_years + fever))

base::summary(mice::pool(model_fit))
##          term              estimate             std.error            statistic                 df            p.value
## 1 (Intercept) 3.701486578805973e+01 0.0217321665495329303 1703.229436590861042  70.31704603789295 0.0000000000000000
## 2   age_years 4.669248580270411e-04 0.0006001054488547102    0.778071352156789 232.33139741907937 0.4373185911976987
## 3    feveryes 1.991953413838700e+00 0.0196258668335069671  101.496327817626181 143.07991943493775 0.0000000000000000

Here we used the mice default method of imputation, which is Predictive Mean Matching. We then used these imputed datasets to separately estimate and then pool results from simple linear regressions on each of these datasets. There are many details we’ve glossed over and many settings you can adjust during the Multiple Imputation process while using the mice package. For example, you won’t always have numerical data and might need to use other imputation methods (you can still use the mice package for many other types of data and methods). But, for a more robust analysis when missing data is a significant concern, Multiple Imputation is good solution that isn’t always much more work than doing a complete case analysis.

Resources

Vignette on the naniar package

Gallery of missing value visualizations

Descriptive analysis

This page demonstrates the use of base R, dplyr, and gtsummary to produce tabulations and descriptive statistics, and to conduct simple statistical tests. Each of these tools have advantages and disadvantages in the areas of code simplicity, accessibility of outputs, quality of printed outputs. We hope one of these approaches will work for you.

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  skimr,        # get overview of data
  tidyverse,    # data management + ggplot2 graphics, 
  gtsummary,    # summary statistics and tests
  janitor,      # adding totals and percents to tables
  flextable,    # converting tables to HTML
  corrr         # correlation analayis for numeric variables
  )

Load data

The example dataset used in this section is a linelist of individual cases from a simulated epidemic.

The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below.

Browse data

skimr package

Using the skimr package you can get a detailed and aesthetically pleasing overview of each of the variables in your dataset. Read more about skimr at its github page.

Below, the function skim() is applied to the entire linelist data frame. An overview of the data frame and a summary of every column (by class) is produced.

## get information about each variable in a dataset 
skim(linelist)
Table 2: Data summary
Name linelist
Number of rows 5888
Number of columns 30
_______________________
Column type frequency:
character 13
Date 4
factor 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 0 1.00 6 6 0 5888 0
outcome 1323 0.78 5 7 0 2 0
gender 284 0.95 1 1 0 2 0
age_unit 0 1.00 5 6 0 2 0
hospital 0 1.00 5 36 0 6 0
infector 2088 0.65 6 6 0 2697 0
source 2088 0.65 5 7 0 2 0
fever 246 0.96 2 3 0 2 0
chills 246 0.96 2 3 0 2 0
cough 246 0.96 2 3 0 2 0
aches 246 0.96 2 3 0 2 0
vomit 246 0.96 2 3 0 2 0
time_admission 744 0.87 5 5 0 1069 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_infection 2087 0.65 2014-03-19 2015-04-27 2014-10-11 359
date_onset 0 1.00 2014-04-07 2015-04-30 2014-10-21 367
date_hospitalisation 0 1.00 2014-04-17 2015-04-30 2014-10-23 363
date_outcome 936 0.84 2014-04-19 2015-06-04 2014-11-01 371

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_cat 87 0.99 FALSE 8 5-9: 1103, 20-: 1102, 0-4: 1066, 10-: 918
age_cat5 87 0.99 FALSE 18 5-9: 1103, 0-4: 1066, 10-: 918, 15-: 773

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
generation 0 1.00 16.559999999999999 5.79 0.000000000000000 13.000000000000000 16.000000000000000 20.00 37.00
age 87 0.99 16.199999999999999 12.71 0.000000000000000 6.000000000000000 13.000000000000000 23.00 90.00
age_years 87 0.99 16.140000000000001 12.73 0.000000000000000 6.000000000000000 13.000000000000000 23.00 90.00
lon 0 1.00 -13.230000000000000 0.02 -13.270000000000000 -13.250000000000000 -13.230000000000000 -13.22 -13.21
lat 0 1.00 8.470000000000001 0.01 8.449999999999999 8.460000000000001 8.470000000000001 8.48 8.49
wt_kg 0 1.00 53.149999999999999 18.49 -8.000000000000000 42.000000000000000 55.000000000000000 66.00 121.00
ht_cm 0 1.00 124.799999999999997 49.37 6.000000000000000 89.000000000000000 130.000000000000000 158.00 335.00
ct_blood 0 1.00 21.190000000000001 1.67 16.000000000000000 20.000000000000000 22.000000000000000 22.00 26.00
temp 135 0.98 38.539999999999999 1.00 35.200000000000003 38.100000000000001 38.799999999999997 39.20 40.90
bmi 0 1.00 48.109999999999999 50.55 -416.670000000000016 24.809999999999999 32.859999999999999 51.02 1234.57
days_onset_hosp 0 1.00 2.010000000000000 2.22 0.000000000000000 1.000000000000000 1.000000000000000 3.00 22.00

Base R

You can also use the summary() function, from base R, to get information about an entire data sets. Provide the name of the dataset to summary() and it will return an overview of each column in a whole dataset. The values returned will depend on the class of each column. However, this output can be more difficult to read than using skimr.

## get information about each variable in a dataset 
summary(linelist)

Descriptive tables

You have several choices when producing tabulations, cross-tabulations, and statistical summaries. Some of the factors to consider include code simplicity and ease, where the output appears (R console, or Viewer pane), and what you can do with the data afterward. All of the options below have strengths and weakness; each can create simple or complex tables. Nevertheless, consider the bullets below as you consider which tool to use.

  • Use table() and summary() from base R to quickly view tables and statistics in the console
  • Use count() and summarise() from dplyr within the context of a pipe chain or if preparing data for ggplot()
  • Use tbl_summary() from gtsummary to produce detailed publication-ready tables

base R

Statistical functions

To print summary statistics on a numeric column, base R functions can be the easiest and fastest to use. These functions are also often used within more complex code operations, for example if grouping and summarising columns, or referencing a max() value to calibrate plot height.

See the R Basics page for a complete list of mathematical operators such as max(), min(), median(), mean(), quantile(), sd(), and range().

CAUTION: If your data contain missing values, R wants you to know this and so will return NA unless you specify to the above mathematical functions that you want R to ignore missing values, via the argument na.rm = TRUE.

You can return most of the important summary statistics for a numeric column using summary(), as below. Note that the dataframe must also be specified.

summary(linelist$age_years)
##           Min.        1st Qu.         Median           Mean        3rd Qu.           Max.           NA's 
##  0.00000000000  6.00000000000 13.00000000000 16.14425673734 23.00000000000 90.00000000000             87

Tables

Use the function table() to print counts of each unique value to the R console. You must specify the dataframe and the column, as shown below.

CAUTION: NA (missing) values will not be tabulated unless you include the argument useNA = "always" (which could also be set to “no” or “ifany”).

table(linelist$outcome, useNA = "always")
## 
##   Death Recover    <NA> 
##    2582    1983    1323

Two columns (or even three!) can be cross-tabulated by listing them one after the other, separated by commas. Optionally, you can assign each column a “name” like Outcome = linelist$outcome to help distinguishing them in the printed table. This is how you can create a classic epidemiological 2x2 table.

age_by_outcome <- table(linelist$age_cat, linelist$outcome, useNA = "always") # save table as object
age_by_outcome   # print table
##        
##         Death Recover <NA>
##   0-4     482     361  223
##   5-9     526     329  248
##   10-14   384     307  227
##   15-19   326     268  179
##   20-29   469     392  241
##   30-49   316     252  156
##   50-69    49      31   22
##   70+       6       5    2
##   <NA>     24      38   25

You can return proportions instead by passing the above table to the function prop.table(), as shown below. Use the margins = argument to specify whether you want the proportions to be of rows (1), of columns (2), or of the whole table (3). For clarity, we pipe the table to the round() function from base R, specifying 2 digits.

# get proportions of table defined above, by rows, rounded
prop.table(age_by_outcome, 1) %>% round(2)
##        
##         Death Recover <NA>
##   0-4    0.45    0.34 0.21
##   5-9    0.48    0.30 0.22
##   10-14  0.42    0.33 0.25
##   15-19  0.42    0.35 0.23
##   20-29  0.43    0.36 0.22
##   30-49  0.44    0.35 0.22
##   50-69  0.48    0.30 0.22
##   70+    0.46    0.38 0.15
##   <NA>   0.28    0.44 0.29

To add row and column totals, pass the table to addmargins(). This works for both counts and proportions.

addmargins(age_by_outcome)
##        
##         Death Recover <NA>  Sum
##   0-4     482     361  223 1066
##   5-9     526     329  248 1103
##   10-14   384     307  227  918
##   15-19   326     268  179  773
##   20-29   469     392  241 1102
##   30-49   316     252  156  724
##   50-69    49      31   22  102
##   70+       6       5    2   13
##   <NA>     24      38   25   87
##   Sum    2582    1983 1323 5888

Converting a table() object like the one above directly to a data frame is surprisingly not straight-forward. You may want to convert it to a data frame export, to apply further changes, or to print nicely as an HTML table. One approach for this conversion is demonstrated below:

  1. Create the table, without using useNA = "always", instead convert the NA values to “(Missing)” with fct_explicit_na() from the forcats package. This is important for steps 3 and 4.
  2. Add totals (optional) by piping to addmargins()
  3. Pipe to the base R function as.data.frame.matrix()
  4. Pipe the table to the dplyr function add_rownames(), specifying the name for the first column
  5. Print, View, or export as desired. In this example we use flextable() from package flextable as described in the Tables page. This will print to the RStudio viewer pane as a pretty HTML.
table(fct_explicit_na(linelist$age_cat), fct_explicit_na(linelist$outcome)) %>% 
  addmargins() %>% 
  as.data.frame.matrix() %>% 
  add_rownames(var = "Age Category") %>% 
  flextable()

Below is an alternative method for adding totals and percents. The totals and formatting of counts and percents is added after conversion to class Data Frame. The adorn_xxx() functions from janitor only work on a data frame.

table(fct_explicit_na(linelist$age_cat), fct_explicit_na(linelist$outcome)) %>% 
  as.data.frame.matrix() %>% 
  add_rownames(var = "Age Category") %>% 
  adorn_totals() %>%
  adorn_percentages(denominator = "row") %>% 
  adorn_pct_formatting() %>%
  adorn_ns(position = "front") %>% 
  flextable() %>% autofit()

gtsummary package

If you want to print your summary statistics in a pretty, publication-ready graphic, you can use the gtsummary package and its function tbl_summary(). The code can seem complex at first, but the outputs look very nice and print to your RStudio Viewer panel as HTML. Read a vignette here.

To introduce tbl_summary() we will show the most basic behavior first, which actually produces a large and beautiful table. Then, we will examine in detail how to make adjustments and more tailored tables.

Summary table

The default behavior of tbl_summary() is quite incredible - it takes the columns you provide and creates a summary table. The function prints statistics appropriate to the column class: median and inter-quartile range (IQR) for numeric columns, and counts (%) for categorical or binary columns. Missing values are converted to “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top.

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>%  # keep columns of interest
  tbl_summary()                                                  # default tbl_summary()
Characteristic N = 5,8881
age_years 13 (6, 23)
Unknown 87
gender
f 2,815 (50%)
m 2,789 (50%)
Unknown 284
outcome
Death 2,582 (57%)
Recover 1,983 (43%)
Unknown 1,323
fever 4,492 (80%)
Unknown 246
temp 38.80 (38.10, 39.20)
Unknown 135
hospital
Central Hospital 454 (7.7%)
Military Hospital 896 (15%)
Missing 1,469 (25%)
Other 885 (15%)
Port Hospital 1,762 (30%)
St. Mark's Maternity Hospital (SMMH) 422 (7.2%)

1 Median (IQR); n (%)

Now we will explain how the function works and how to make adjustments. The key arguments are detailed below:

by =
You can stratify your table by a column (e.g. by outcome), creating a 2-way table.

statistic =
Indicate which statistics to show and how to display them with an equation. There are two sides to the equation, separated by a tilde ~. On the right in quotes is the statistical display desired, and on the left are the columns to which that display will apply.

  • The right side of the equation uses the syntax of str_glue() from stringr (see Characters and Strings), with the desired display string in quotes and the statistics themselves within curly brackets. You can include statistics like “n” (for counts), “N” (for denominator), “mean”, “median”, “sd”, “max”, “min”, percentiles as “p##” like “p25”, or percent of total as “p”. See ?tbl_summary for details.
  • For the left side of the equation, you can specify columns by name (e.g. age or c(age, gender)) or using helpers such as all_continuous(), all_categorical(), contains(), starts_with(), etc.

A simple example of a statistic = equation might look like below, to only print the mean of column age_years:

linelist %>% 
  select(age_years) %>%         # keep only columns of interest 
  tbl_summary(                  # create summary table
    statistic = age_years ~ "{mean}") # print mean of age
Characteristic N = 5,8881
age_years 16
Unknown 87

1 Mean

A slightly more complex equation might look like this, incorporating the max and min values within parentheses and separated by a comma:

statistic = age_years ~ "({min}, {max})"

You can also differentiate syntax for separate columns or types of columns. In the more complex example below, the value provided to statistc = is a list indicating that for all continuous columns the table should print mean with standard deviation in parentheses, while for all categorical columns it should print the n, denominator, and percent.

digits =
Adjust the digits and rounding. Optionally, this can be specified to be for continuous columns only (as below).

label =
Adjust how the column name should be displayed. Provide the column name and its desired label separated by a tilde. The default is the column name.

missing_text =
Adjust how missing values are displayed. The default is “Unknown”.

type =
This is used to adjust how many levels of the statistics are shown. The syntax is similar to statistic = in that you provide an equation with columns on the left and a value on the right. Two common scenarios include:

  • type = all_categorical() ~ "categorical" Forces dichotomous columns (e.g. fever) to show all levels instead of only the “yes” row
  • type = all_continuous() ~ "continuous2" Allows multi-line statistics per variable, as shown in a later section

In the example below, each of these arguments is used to modify the original summary table:

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% # keep only columns of interest
  tbl_summary(     
    by = outcome,                                               # stratify entire table by outcome
    statistic = list(all_continuous() ~ "{mean} ({sd})",        # stats and format for continuous columns
                     all_categorical() ~ "{n} / {N} ({p}%)"),   # stats and format for categorical columns
    digits = all_continuous() ~ 1,                              # rounding for continuous columns
    type   = all_categorical() ~ "categorical",                 # force all categorical levels to display
    label  = list(                                              # display labels for column names
      outcome   ~ "Outcome",                           
      age_years ~ "Age (years)",
      gender    ~ "Gender",
      temp      ~ "Temperature",
      hospital  ~ "Hospital"),
    missing_text = "Missing"                                    # how missing values should display
  )
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831
Age (years) 16.0 (13.0) 16.4 (12.7)
Missing 24 38
Gender
f 1,246 / 2,477 (50%) 946 / 1,879 (50%)
m 1,231 / 2,477 (50%) 933 / 1,879 (50%)
Missing 105 104
fever
no 513 / 2,479 (21%) 368 / 1,892 (19%)
yes 1,966 / 2,479 (79%) 1,524 / 1,892 (81%)
Missing 103 91
Temperature 38.5 (1.0) 38.6 (1.0)
Missing 55 52
Hospital
Central Hospital 193 / 2,582 (7.5%) 165 / 1,983 (8.300000000000001%)
Military Hospital 399 / 2,582 (15%) 309 / 1,983 (16%)
Missing 611 / 2,582 (24%) 514 / 1,983 (26%)
Other 395 / 2,582 (15%) 290 / 1,983 (15%)
Port Hospital 785 / 2,582 (30%) 579 / 1,983 (29%)
St. Mark's Maternity Hospital (SMMH) 199 / 2,582 (7.7%) 126 / 1,983 (6.4%)

1 Mean (SD); n / N (%)

Multi-line stats for continuous variables

If you want to print multiple lines of statistics for continuous variables, you can indicate this by setting the type = to “continuous2”. You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”. The number of missing values is shown as “Unknown”.

linelist %>% 
  select(age_years, temp) %>%                      # keep only columns of interest
  tbl_summary(                                     # create summary table
    type = all_continuous() ~ "continuous2",       # indicate that you want to print multiple statistics 
    statistic = all_continuous() ~ c(
      "{mean} ({sd})",                             # line 1: mean and SD
      "{median} ({p25}, {p75})",                   # line 2: median and IQR
      "{min}, {max}")                              # line 3: min and max
    )
Characteristic N = 5,888
age_years
Mean (SD) 16 (13)
Median (IQR) 13 (6, 23)
Range 0, 90
Unknown 87
temp
Mean (SD) 38.54 (1.00)
Median (IQR) 38.80 (38.10, 39.20)
Range 35.20, 40.90
Unknown 135

There are many other ways to modify these tables, including adding p-values, adjusting color and headings, etc. Many of these are described in the documentation (enter ?tbl_summary in Console), and some are given in the section on statistical tests.

dplyr package

Creating cross-tabulations with dplyr is less straightforward, as such outputs do not fit within the tidyverse dataset structure. However, this approach to tabulation is useful if you are working within a longer pipe chain, and if you want to pass the results to ggplot() (which expects “long” data). See the Cleaning data and core functions page for an example of a pipe chain.

Use the dplyr function count() to return tabulated counts. This function as applied to group data is described in depth in the Grouping data page. The output returned is in a “long” format and a column n has been created to hold the counts.

linelist %>% 
  count(age_cat)
##   age_cat    n
## 1     0-4 1066
## 2     5-9 1103
## 3   10-14  918
## 4   15-19  773
## 5   20-29 1102
## 6   30-49  724
## 7   50-69  102
## 8     70+   13
## 9    <NA>   87

You can cross-tabulate two or more columns by adding them within the count() function. Note the format is different than for table() - it is “long” in that each unique combination of the two columns is listed, with the counts in the n column. Also note that missing values are considered in the unique combinations.

linelist %>% 
  count(age_cat, gender)
##    age_cat gender   n
## 1      0-4      f 624
## 2      0-4      m 404
## 3      0-4   <NA>  38
## 4      5-9      f 651
## 5      5-9      m 414
## 6      5-9   <NA>  38
## 7    10-14      f 555
## 8    10-14      m 334
## 9    10-14   <NA>  29
## 10   15-19      f 381
## 11   15-19      m 367
## 12   15-19   <NA>  25
## 13   20-29      f 440
## 14   20-29      m 626
## 15   20-29   <NA>  36
## 16   30-49      f 161
## 17   30-49      m 539
## 18   30-49   <NA>  24
## 19   50-69      f   3
## 20   50-69      m  93
## 21   50-69   <NA>   6
## 22     70+      m  12
## 23     70+   <NA>   1
## 24    <NA>   <NA>  87

To pipe this output to ggplot() is relatively straight-forward. See further examples in the pages [Plotting categorical data] and ggplot tips.

linelist %>% 
  count(outcome, age_cat) %>% 
  ggplot()+
    geom_bar(aes(x = outcome, fill = age_cat, y = n), stat = "identity")

Add proportions

To add proportions or percents in a new column, use mutate() on the counted data frame as below. Note that the data remain in “long” format (not like table() above).

linelist %>% 
  count(outcome) %>%                     # counts by outcome 
  mutate(percentage = n / sum(n) * 100)  # calculate proportion
##   outcome    n        percentage
## 1   Death 2582 43.85190217391305
## 2 Recover 1983 33.67866847826087
## 3    <NA> 1323 22.46942934782609

You can calculate proportions within groups by having two levels of aggregation prior to using mutate(). The below table first groups the data frame by outcome and then groups/counts by age_cat, achieving the breakdown of age by outcome. Note that you can add more stratifications by adding columns to the group_by() command.

linelist %>% 
  group_by(outcome) %>%                  # group first by outcome 
  count(age_cat) %>%                     # group again and count by gender (produces n column)
  mutate(percentage = n / sum(n) * 100)  # calculate proportion - note the denominator is by outcome group
## # A tibble: 27 x 4
## # Groups:   outcome [3]
##    outcome age_cat     n percentage
##    <chr>   <fct>   <int>      <dbl>
##  1 Death   0-4       482     18.7  
##  2 Death   5-9       526     20.4  
##  3 Death   10-14     384     14.9  
##  4 Death   15-19     326     12.6  
##  5 Death   20-29     469     18.2  
##  6 Death   30-49     316     12.2  
##  7 Death   50-69      49      1.90 
##  8 Death   70+         6      0.232
##  9 Death   <NA>       24      0.930
## 10 Recover 0-4       361     18.2  
## # ... with 17 more rows

Note that it is possible to change the below table to wide format, making it more like a two-by-two (cross tabulation), using the tidyr pivot_wider() function. This would be done by adding this to the end of the code: pivot_wider(names_from = age_cat, values_from = c(n, percentage)) For more information see the page on Pivoting data.

If you want to display a table produced using count(), you can add totals, percents, and proportions using the package janitor. See a detailed example in the Grouping data page, and a brief example below:

pacman::p_load(janitor)

linelist %>% 
  count(outcome) %>%              # produce the counts by unique outcome
  adorn_totals(where = "row") %>% # add total row
  adorn_percentages("col") %>%    # add proportion by column
  adorn_pct_formatting() %>%      # proportion converted to percent
  adorn_ns(position = "front")    # Add the underlying N, in front of the percentage
##  outcome             n
##    Death 2582  (43.9%)
##  Recover 1983  (33.7%)
##     <NA> 1323  (22.5%)
##    Total 5888 (100.0%)

summarise()

You can also use dplyr to create a table with different summary statistics, for example mean, median, range, standard deviation and percentiles. You can also show these all in one table. This is discussed in detail in the page on Grouping data.

Note the argument na.rm = TRUE, which removes missing values from the calculation. If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  summarise(mean = mean(age_years, na.rm = TRUE)) # get the mean value of age while excluding missings
##                mean
## 1 16.14425673734414

Instead of mean, you can also use other base R statistical functions like median(), max(), sd(), etc. To return percentiles, use quantile() with the defaults or specify the value(s) you would like.

# get default percentile values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(percentiles = quantile(age_years, na.rm = TRUE))
##   percentiles
## 1           0
## 2           6
## 3          13
## 4          23
## 5          90
# get specified percentile values of age (5%, 50%, 75%, 98%)
linelist %>% 
  summarise(percentiles = quantile(age_years,
                                   probs = c(.05, 0.5, 0.75, 0.98), 
                                   na.rm=TRUE))
##   percentiles
## 1           1
## 2          13
## 3          23
## 4          49

You can combine all of the previously shown statistical functions in one summary table. One nuance is that to display the quantiles and range in one cell (separated by commas) you will need to use the str_c function from stringr. See the page on Characters and strings for more details.

linelist %>% 
  summarise(
    mean   = mean(age_years, na.rm = TRUE),   # mean
    SD     = sd(age_years, na.rm = TRUE),     # standard deviation
    median = median(age_years, na.rm = TRUE), # median 
    IQR = str_c(                              # IQR, elements separated by a comma
      quantile(age_years, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      ), 
    range = str_c(                            # range, elements separated by a comma
      range(age_years, na.rm = TRUE), 
      collapse = ", "
    )
  )
##                mean                SD median   IQR range
## 1 16.14425673734414 12.73158643714777     13 6, 23 0, 90

Lastly, another option is to use the janitor package tabyl function.

Statistical tests

base R

You can use base R functions to produce the results of statistical tests. The commands are relatively simple and results will print to the R Console for simple viewing. However, the outputs are usually lists and so are harder to manipulate if you want to use the results in subsequent code operations.

T-tests

Syntax 1: Best is your numeric and categorical columns are in the same data frame. Provide the numeric column on the left side of the equation and the categorical column on the right side. Specify the dataset to data =. Optionally, set paired = TRUE, and conf.level = (0.95 default), and alternative = (either “two.sided”, “less”, or “greater”). Enter ?t.test for more details.

## compare mean age by outcome group with a t-test
t.test(age_years ~ outcome, data = linelist)
## 
##  Welch Two Sample t-test
## 
## data:  age_years by outcome
## t = -1.1251854893103, df = 4231.1090118933, p-value = 0.2605742631641
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.187864600259736  0.321569401233610
## sample estimates:
##   mean in group Death mean in group Recover 
##     16.00498436278342     16.43813196229649

Syntax 2: You can compare two separate numeric vectors using this alternative syntax. For example, if the two columns are in different data sets.

t.test(df1$age_years, df2$age_years)

Conduct a one-sample t-test with the known/hypothesized populaton mean on the right side of the equation:

t.test(linelist$age_years, mu = 45)

Shapiro-Wilk’s test

shapiro.test(linelist$age_years)

Wilcoxon rank sum test

## compare age distribution by outcome group with a wilcox test
wilcox.test(age_years ~ outcome, data = linelist)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age_years by outcome
## W = 2412546, p-value = 0.08205880090267
## alternative hypothesis: true location shift is not equal to 0

Kruskal-wallis test

## compare age distribution by outcome group with a kruskal-wallis test
kruskal.test(age_years ~ outcome, linelist)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  age_years by outcome
## Kruskal-Wallis chi-squared = 3.0236860331971, df = 1, p-value = 0.0820567643027

Chi-squared test

## compare the proportions in each group with a chi-squared test
chisq.test(linelist$gender, linelist$outcome)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  linelist$gender and linelist$outcome
## X-squared = 0, df = 1, p-value = 1

gtsummary package

Use gtsummary if you are looking to add the results of a statistical test to a pretty table (described in section above). Performing statistical tests of comparison with tbl_summary is done by adding the add_p function to a table and specifying which test to use. It is possible to get p-values corrected for multiple testing by using the add_q function. Run ?tbl_summary for details.

Chi-squared test

Compare the proportions of a categorical variable in two groups. The default statistical test for add_p() is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.

linelist %>% 
  select(gender, outcome) %>%    # keep variables of interest
  tbl_summary(by = outcome) %>%  # produce summary table and specify grouping variable
  add_p()                        # specify what test to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
gender >0.9
f 1,246 (50%) 946 (50%)
m 1,231 (50%) 933 (50%)
Unknown 105 104

1 n (%)

2 Pearson's Chi-squared test

T-tests

Compare the difference in means for a continuous variable in two groups. For example, compare the mean age by patient outcome.

linelist %>% 
  select(age_years, outcome) %>%             # keep variables of interest
  tbl_summary(                               # produce summary table
    statistic = age_years ~ "{mean} ({sd})", # specify what statistics to show
    by = outcome) %>%                        # specify the grouping variable
  add_p(age_years ~ "t.test")                # specify what tests to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 16 (13) 16 (13) 0.3
Unknown 24 38

1 Mean (SD)

2 Welch Two Sample t-test

Wilcoxon rank sum test

Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.

linelist %>% 
  select(age_years, outcome) %>%                       # keep variables of interest
  tbl_summary(                                         # produce summary table
    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
    by = outcome) %>%                                  # specify the grouping variable
  add_p(age_years ~ "wilcox.test")                     # specify what test to perform (default so could leave brackets empty)
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 13 (6, 23) 14 (6, 23) 0.082
Unknown 24 38

1 Median (IQR)

2 Wilcoxon rank sum test

Kruskal-wallis test

Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.

linelist %>% 
  select(age_years, outcome) %>%                       # keep variables of interest
  tbl_summary(                                         # produce summary table
    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
    by = outcome) %>%                                  # specify the grouping variable
  add_p(age_years ~ "kruskal.test")                    # specify what test to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 13 (6, 23) 14 (6, 23) 0.082
Unknown 24 38

1 Median (IQR)

2 Kruskal-Wallis rank sum test

dplyr package

Performing statistical tests in dplyr alone is very dense, again because it does not fit within the tidy-data framework. It requires using purrr to create a list of dataframes for each of the subgroups you want to compare. See the page on Iteration and loops to learn about purrr.

An easier alternative may be the rstatix package.

T-tests

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the mean age for the death group
    Death_mean = map(Death, ~mean(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_sd = map(Death, ~sd(.x$age, na.rm = TRUE)),
    ## calculate the mean age for the recover group
    Recover_mean = map(Recover, ~mean(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_sd = map(Recover, ~sd(.x$age, na.rm = TRUE)),
    ## using both grouped data sets compare mean age with a t-test
    ## keep only the p.value
    t_test = map2(Death, Recover, ~t.test(.x$age, .y$age)$p.value)
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_mean Death_sd Recover_mean Recover_sd t_test
##        <dbl>    <dbl>        <dbl>      <dbl>  <dbl>
## 1       16.1     12.9         16.5       12.7  0.251

Wilcoxon rank sum test

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the median age for the death group
    Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_iqr = map(Death, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## calculate the median age for the recover group
    Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_iqr = map(Recover, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## using both grouped data sets compare age distribution with a wilcox test
    ## keep only the p.value
    wilcox = map2(Death, Recover, ~wilcox.test(.x$age, .y$age)$p.value)
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_median Death_iqr Recover_median Recover_iqr wilcox
##          <dbl> <chr>              <dbl> <chr>        <dbl>
## 1           13 6, 23                 14 6, 24       0.0798

Kruskal-wallis test

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the median age for the death group
    Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_iqr = map(Death, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## calculate the median age for the recover group
    Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_iqr = map(Recover, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## using the original data set compare age distribution with a kruskal test
    ## keep only the p.value
    kruskal = kruskal.test(linelist$age, linelist$outcome)$p.value
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_median Death_iqr Recover_median Recover_iqr kruskal
##          <dbl> <chr>              <dbl> <chr>         <dbl>
## 1           13 6, 23                 14 6, 24        0.0798

Chi-squared test

linelist %>% 
  ## do everything by gender 
  group_by(outcome) %>% 
  ## count the variable of interest
  count(gender) %>% 
  ## calculate proportion 
  ## note that the denominator here is the sum of each gender
  mutate(percentage = n / sum(n) * 100) %>% 
  pivot_wider(names_from = outcome, values_from = c(n, percentage)) %>% 
  filter(!is.na(gender)) %>% 
  mutate(pval = chisq.test(linelist$gender, linelist$outcome)$p.value)
## # A tibble: 2 x 8
##   gender n_Death n_Recover  n_NA percentage_Death percentage_Recover percentage_NA  pval
##   <chr>    <int>     <int> <int>            <dbl>              <dbl>         <dbl> <dbl>
## 1 f         1246       946   623             48.3               47.7          47.1     1
## 2 m         1231       933   625             47.7               47.0          47.2     1

Correlations

Correlation between numeric variables can be investigated using the tidyverse
corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.

correlation_tab <- linelist %>% 
  select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>%   # keep numeric variables of interest
  correlate()      # create correlation table (using default pearson)

correlation_tab    # print
## # A tibble: 6 x 7
##   term            generation       age ct_blood days_onset_hosp    wt_kg    ht_cm
##   <chr>                <dbl>     <dbl>    <dbl>           <dbl>    <dbl>    <dbl>
## 1 generation       NA         0.000371   0.195          -0.275   0.00715  0.00486
## 2 age               0.000371 NA          0.0150         -0.0139  0.832    0.877  
## 3 ct_blood          0.195     0.0150    NA              -0.601   0.0193   0.0226 
## 4 days_onset_hosp  -0.275    -0.0139    -0.601          NA      -0.0210  -0.0266 
## 5 wt_kg             0.00715   0.832      0.0193         -0.0210 NA        0.876  
## 6 ht_cm             0.00486   0.877      0.0226         -0.0266  0.876   NA
## remove duplicate entries (the table above is mirrored) 
correlation_tab <- correlation_tab %>% 
  shave()

## view correlation table 
correlation_tab
## # A tibble: 6 x 7
##   term            generation     age ct_blood days_onset_hosp  wt_kg ht_cm
##   <chr>                <dbl>   <dbl>    <dbl>           <dbl>  <dbl> <dbl>
## 1 generation       NA        NA       NA              NA      NA        NA
## 2 age               0.000371 NA       NA              NA      NA        NA
## 3 ct_blood          0.195     0.0150  NA              NA      NA        NA
## 4 days_onset_hosp  -0.275    -0.0139  -0.601          NA      NA        NA
## 5 wt_kg             0.00715   0.832    0.0193         -0.0210 NA        NA
## 6 ht_cm             0.00486   0.877    0.0226         -0.0266  0.876    NA
## plot correlations 
rplot(correlation_tab)

Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary dplyr corrr sthda correlation

Univariate and multivariate regression

Overview

This page demonstrates the use of gtstummary and base R regression function such as glm() to look at associations between variables (e.g. odds ratios, risk ratios and hazard ratios)

  1. Univariate: two-by-two tables
  2. Stratified: mantel-haenszel estimates
  3. Multivariable: variable selection, model selection, final table
  4. Forest plots

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  tidyverse,    # data management + ggplot2 graphics, 
  stringr,      # manipulate text strings 
  purrr,        # loop over objects in a tidy way
  gtsummary,    # summary statistics and tests 
  broom,        # tidy up results from regressions
  parameters,   # alternative to tidy up results from regressions
  see           # alternative to visualise forest plots
  )

Load data

The example dataset used in this section is a linelist of individual cases from a simulated epidemic. The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below.

Clean data

Store the explanatory variables

## define variables of interest 
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

Convert to 1’s and 0’s

Below we convert the explanatory columns from “yes”/“no” “m”/“f”, and “dead”/“alive” to 1 / 0, to cooperate with the expectations of logistic regression models. To do this efficiently, we define a vector of the column names of our explanatory variables.

We apply the function case_when() to convert specified values to 1’s and 0’s. This function is applied all the explanatory_vars columns using across() (see page on Grouping data).

## convert dichotomous variables to 0/1 
linelist <- linelist %>% 
  mutate(
    ## for each of the variables listed and "outcome"
    across(
      all_of(c(explanatory_vars, "outcome")), 
      ## recode male, yes and death to 1; female, no and recover to 0
      ## otherwise set to missing
           ~case_when(
             . %in% c("m", "yes", "Death")   ~ 1,
             . %in% c("f", "no",  "Recover") ~ 0, 
             TRUE                            ~ NA_real_
           ))
  )

Drop rows with missing values
To do this, we add the column age to the explanatory_vars (age would have produced an error in the previous case_when() operation). Then we pipe the linelist to drop_na() to remove any rows with missing values in the outcome column or any of the explanatory_vars columns.

## add in age_category to the explanatory vars 
explanatory_vars <- c(explanatory_vars, "age_cat")

## drop rows with missing information for variables of interest 
linelist <- linelist %>% 
  drop_na(any_of(c("outcome", explanatory_vars)))

The number of rows remaining in linelist is 4166.

Univariate

Just like in the page on Descriptive analysis, your use case will determine which R package you use. We present two options for doing univariate analysis:

  • Use functions available in base to quickly print results to the console. Accompanied with the broom package to tidy up the outputs.
  • Use the gtsummary package or you can use the individual regression

gtsummary package

Below we present the use of tbl_uvregression() from the gtsummary package. Just like in the page on Descriptive analysis, gtsummary functions do a good job of running statistics and producing professional-looking outputs. This function produces a table of univariate regression results.

In this case, we select only the necessary columns from the linelist and then pipe into tbl_uvregression(). We are going to run univariate regression on each of the columns we defined as explanatory_vars in the Preparation section (gender, fever, chills, cough, aches, vomit, and age_cat).

Within the function itself, we provide the method = as glm (no quotes), the outcome column as outcome, specify that we want to run logistic regression (it’s in a list because you could specify multiple), and we tell it to exponentiate the results.

The output is HTML and contains the counts

univ_tab <- linelist %>% 
  ## select variables of interest
  dplyr::select(explanatory_vars, outcome) %>% 
  ## produce univariate table
  tbl_uvregression(
    ## define regression want to run (generalised linear model)
    method = glm, 
    ## define outcome variable
    y = outcome, 
    ## define what type of glm want to run (logistic)
    method.args = list(family = binomial), 
    ## exponentiate the outputs to produce odds ratios (rather than log odds)
    exponentiate = TRUE
  )
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(explanatory_vars)` instead of `explanatory_vars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## view univariate results table 
univ_tab
Characteristic N OR1 95% CI1 p-value
gender 4,166 1.00 0.89, 1.13 >0.9
fever 4,166 0.94 0.8100000000000001, 1.10 0.5
chills 4,166 1.00 0.86, 1.16 >0.9
cough 4,166 1.09 0.91, 1.29 0.4
aches 4,166 1.09 0.89, 1.33 0.4
vomit 4,166 1.02 0.90, 1.15 0.7
age_cat 4,166
0-4
5-9 1.12 0.92, 1.37 0.3
10-14 0.87 0.71, 1.08 0.2
15-19 0.89 0.71, 1.10 0.3
20-29 0.84 0.69, 1.03 0.093
30-49 0.85 0.6800000000000000, 1.07 0.2
50-69 1.15 0.70, 1.91 0.6
70+ 0.70 0.19, 2.55 0.6

1 OR = Odds Ratio, CI = Confidence Interval

There are many modifications you can make to this table output, such as automatically bolding rows by their p-value, etc. See tutorials here and elsewhere online.

base R

To produce odds ratios or for logistic regression we use the glm() function from the stats package (part of base R). GLM is an acronym for Generalized Linear Model. Unlike gtsummary, you would need to run this multiple times (e.g. within a loop) to run univariate regression on multiple variables. An example of this is given below.

Univariate glm() on one variable

The model is provided to glm() as an equation, with the outcome on the left and explanatory variables on the right of a tilde ~. In this example we are assessing the association between different age categories and the outcome of death (now coded as 1, see Preparation section).

The dataset is specified to the dataset = argument.

The family = is set to “binomial” to indicate logistic regression. Options for family = are below. If necessary, you can also specify the link function in the syntax glm(formula, family=familytype(link=linkfunction), data=).

Family Default link function
"'binomial" (link = "logit")
"gaussian" (link = "identity")
"Gamma" (link = "inverse")
"inverse.gaussian" (link = "1/mu^2")
"poisson" (link = "log")
"quasi" (link = "identity", variance = "constant")
"quasibinomial" (link = "logit")
"quasipoisson" (link = "log")

Type ?glm for other options.

Below is how the output of glm() prints to the console - the log odds. The baseline reference category is the first factor level of age_cat.

glm(outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Call:  glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Coefficients:
##      (Intercept)        age_cat5-9      age_cat10-14      age_cat15-19      age_cat20-29      age_cat30-49      age_cat50-69  
##  0.3504829736908   0.1153452841444  -0.1350344585984  -0.1200783981396  -0.1704471321777  -0.1575793075663   0.1378697942231  
##       age_cat70+  
## -0.3504829736908  
## 
## Degrees of Freedom: 4165 Total (i.e. Null);  4158 Residual
## Null Deviance:       5692.60008553 
## Residual Deviance: 5680.329591051    AIC: 5696.329591051

For a single exposure variable, use tidy() from the broom package to get the exponentiated log odds ratio estimates and confidence intervals. Here we demonstrate how to combine model outputs with a table of counts.

model <- glm(
  ## define the variables of interest
  outcome ~ age_cat, 
  ## define the type of regression (logistic)
  family = "binomial", 
  ## define your dataset
  data = linelist) %>% 
  ## clean up the outputs of the regression (exponentiate and produce CIs)
  tidy(
      exponentiate = TRUE, 
      conf.int = TRUE)

Below is the output of model (tidied via broom):

We can combine these model results with a table of linelist counts. Below, we create the counts table by operating dplyr functions on the linelist used for the model:

  • Group rows by outcome, and get counts by age category
  • Pivot wider so the column are age_cat, 0, and 1
  • Remove row for NA age_cat, if applicable, to align with the model results
counts_table <- linelist %>% 
  ## get counts of variable of interest grouped by outcome
  group_by(outcome) %>% 
  count(age_cat) %>% 
  ## spread to wide format (as in cross-tabulation)
  pivot_wider(names_from = outcome, values_from = n) %>% 
  ## drop rows with missings
  filter(!is.na(age_cat))

Now we can bind the two tables together horizontally with bind_cols(). In this case, the . represents the piped object counts_table and we bind it to model. Then we use select() to pick the columns to keep and their order.

combined <- counts_table %>% 
  ## merge with the outputs of the regression 
  bind_cols(., model) %>% 
  ## only keep columns interested in 
  select(term, 2:3, estimate, conf.low, conf.high, p.value)

Looping multiple univariate models

To run over several exposure variables to produce univariate odds ratios (i.e.  not controlling for each other), you can pass a vector of variable names to the map() function in the purrr package. This will loop over each of the variables running regressions for each one. See the page on [Iteration] for tips.

Below we do the following:

  • Create the glm() equation and pass it to map() from purrr as the .f (formula argument) (see [Iteration] page)
  • Each of the resulting model outputs is passed sequentially to tidy() to exponentiate the log odds and CIs
  • The output (list of model result dataframe) is passed to bind_rows() which combines all of the dataframes into one
models <- explanatory_vars %>% 
  ## combine each name of the variables of interest with the name of outcome variable
  str_c("outcome ~ ", .) %>% 
  ## for each string above (outcome ~ "variable of interest)
  map(
    ## run a general linear model 
    ~glm(
      ## define formula as each of the strings above
      as.formula(.x), 
      ## define type of glm (logistic)
      family = "binomial", 
      ## define your dataset
      data = linelist)
  ) %>% 
  ## for each of the output regressions from above 
  map(
    ## tidy the output
    ~tidy(
      ## each of the regressions 
      .x, 
      ## exponentiate and produce CIs
      exponentiate = TRUE, 
      conf.int = TRUE)
  ) %>% 
  ## collapse the list of regressions outputs in to one data frame
  bind_rows()

This time, the end object models is longer because it now represents the combined results of several univariate regressions. Click through to see all the rows of model.

As before, we can create a counts table from the linelist, bind it to models, and make a nice table. See the page on Tables for ideas on how to convert this table to an HTML output.

## for each explanatory variable
univ_tab_base <- map(explanatory_vars, 
      ~{linelist %>% 
          ## group data set by outcome
          group_by(outcome) %>% 
          ## produce counts for variable of interest
          count(.data[[.x]]) %>% 
          ## spread to wide format (as in cross-tabulation)
          pivot_wider(names_from = outcome, values_from = n) %>% 
          ## drop rows with missings
          filter(!is.na(.data[[.x]])) %>% 
          ## change the variable of interest column to be called "variable"
          rename("variable" = .x) %>% 
          ## change the variable of interest column to be a character 
          ## otherwise non-dichotomous (categorical) variables come out as factor and cant be merged
          mutate(variable = as.character(variable))
                 }
      ) %>% 
  ## collapse the list of count outputs in to one data frame
  bind_rows() %>% 
  ## merge with the outputs of the regression 
  bind_cols(., models) %>% 
  ## only keep columns interested in 
  select(term, 2:3, estimate, conf.low, conf.high, p.value)

Stratified

Stratified analysis is currently still being worked on for gtsummary, this page will be updated in due course.

gtsummary package

TODO

base R

TODO

Multivariate

For multivariate analysis, there is not much difference between using gtsummary or glm() with broom to present the data. The workflow is the same for both, as shown below, and only the last step of pulling a table together is different.

Conduct multivariate

Use glm() but add more variables to the right side of the equation, separated by plus symbols (+). To run the model with all of our explanatory variables we would run:

mv_reg <- glm(outcome ~ gender + fever + chills + cough + aches + vomit + age_cat, family = "binomial", data = linelist)

mv_reg
## 
## Call:  glm(formula = outcome ~ gender + fever + chills + cough + aches + 
##     vomit + age_cat, family = "binomial", data = linelist)
## 
## Coefficients:
##         (Intercept)               gender                fever               chills                cough                aches  
##  2.849372217684e-01   3.742865638451e-02  -5.370896183542e-02   6.619053917735e-05   8.773254239124e-02   9.387777261175e-02  
##               vomit           age_cat5-9         age_cat10-14         age_cat15-19         age_cat20-29         age_cat30-49  
##  1.519181379066e-02   1.198091255806e-01  -1.338779520252e-01  -1.229281717264e-01  -1.750478448012e-01  -1.699321807023e-01  
##        age_cat50-69           age_cat70+  
##  1.150108017087e-01  -3.738310414287e-01  
## 
## Degrees of Freedom: 4165 Total (i.e. Null);  4152 Residual
## Null Deviance:       5692.60008553 
## Residual Deviance: 5677.760715687    AIC: 5705.760715687

Optionally, you can leverage the pre-defined vector of column names and re-create the above command using str_c() as shown below. This might be useful if your explanatory variable names are changing, or you don’t want to type them all out again.

## run a regression with all variables of interest 
mv_reg <- explanatory_vars %>% 
  ## combine all names of the variables of interest separated by a plus
  str_c(collapse = "+") %>% 
  ## combined the names of variables of interest with outcome in formula style
  str_c("outcome ~ ", .) %>% 
  glm(## define type of glm (logistic)
      family = "binomial", 
      ## define your dataset
      data = linelist) 

Note the class of the saved model.

class(mv_reg)
## [1] "glm" "lm"

Finally, you must take the model object and apply the step() function from the stats package, to specify which variable selection direction you want use when building the model.

## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_mv_reg <- mv_reg %>%
  step(direction = "forward", trace = FALSE)

You can also turn off scientific notation, for clarity:

options(scipen=999)

Pass to tidy() as described above to exponentiate the log odds and CIs. Scroll through to see all the rows.

mv_tab_base <- final_mv_reg %>% 
  ## get a tidy dataframe of estimates 
  broom::tidy(exponentiate = TRUE, conf.int = TRUE)

Interaction terms

If you want to specify an interaction between two variables in glm(), separate the variables with a colon :. You can separate them with an asterisk * as shorthand to includ the two variables and their interaction:

glm(outcome ~ gender + age_cat * fever, family = "binomial", data = linelist)

Combine univariate and multivariate

Combine with gtsummary

The gtsummary package provides the tbl_regression function, which will take the outputs from a regression (glm in this case) and produce an easy summary table. You can also combine several different output tables produced by gtsummary with the tbl_merge function.

## show results table of final regression 
mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)

## combine with univariate results 
tbl_merge(
  tbls = list(univ_tab, mv_tab), 
  tab_spanner = c("**Univariate**", "**Multivariable**"))
Characteristic Univariate Multivariable
N OR1 95% CI1 p-value OR1 95% CI1 p-value
gender 4,166 1.00 0.89, 1.13 >0.9 1.04 0.91, 1.18 0.6
fever 4,166 0.94 0.8100000000000001, 1.10 0.5 0.95 0.8100000000000001, 1.10 0.5
chills 4,166 1.00 0.86, 1.16 >0.9 1.00 0.86, 1.17 >0.9
cough 4,166 1.09 0.91, 1.29 0.4 1.09 0.92, 1.30 0.3
aches 4,166 1.09 0.89, 1.33 0.4 1.10 0.90, 1.35 0.4
vomit 4,166 1.02 0.90, 1.15 0.7 1.02 0.90, 1.15 0.8
age_cat 4,166
0-4
5-9 1.12 0.92, 1.37 0.3 1.13 0.92, 1.38 0.2
10-14 0.87 0.71, 1.08 0.2 0.87 0.71, 1.08 0.2
15-19 0.89 0.71, 1.10 0.3 0.88 0.71, 1.10 0.3
20-29 0.84 0.69, 1.03 0.093 0.84 0.69, 1.03 0.08699999999999999
30-49 0.85 0.6800000000000000, 1.07 0.2 0.84 0.67, 1.06 0.15
50-69 1.15 0.70, 1.91 0.6 1.12 0.6800000000000000, 1.88 0.7
70+ 0.70 0.19, 2.55 0.6 0.69 0.19, 2.50 0.6

1 OR = Odds Ratio, CI = Confidence Interval

Combine with dplyr

To combine the glm()/tidy()univariate and multivariate outputs, you can also do the following with dplyr join functions.

  • Join the univariate results from earlier (which contains counts) with the tidied multivariate results
  • Use select() to keep only the columns we want, specify their order, and re-name them
  • Use round() with two decimal places on all the column that are class Double
## combine univariate and multivariable tables 
left_join(univ_tab_base, mv_tab_base, by = "term") %>% 
  ## choose columns and rename them
  select( # new name =  old name
    "characteristic" = term, 
    "recovered"      = "0", 
    "dead"           = "1", 
    "univ_or"        = estimate.x, 
    "univ_ci_low"    = conf.low.x, 
    "univ_ci_high"   = conf.high.x,
    "univ_pval"      = p.value.x, 
    "mv_or"          = estimate.y, 
    "mvv_ci_low"     = conf.low.y, 
    "mv_ci_high"     = conf.high.y,
    "mv_pval"        = p.value.y 
  ) %>% 
  mutate(across(is.double, round, 2))
## Warning: Problem with `mutate()` input `..1`.
## i Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.double)
## 
##   # Good
##   data %>% select(where(is.double))
## 
## i Please update your code.
## This message is displayed once per session.
## i Input `..1` is `across(is.double, round, 2)`.
## # A tibble: 20 x 11
##    characteristic recovered  dead univ_or univ_ci_low univ_ci_high univ_pval mv_or mvv_ci_low mv_ci_high mv_pval
##    <chr>              <int> <int>   <dbl>       <dbl>        <dbl>     <dbl> <dbl>      <dbl>      <dbl>   <dbl>
##  1 (Intercept)          902  1195    1.32        1.22         1.44     0      1.33       1.03       1.72   0.03 
##  2 gender               888  1181    1           0.89         1.13     0.95   1.04       0.91       1.18   0.570
##  3 (Intercept)          352   489    1.39        1.21         1.59     0      1.33       1.03       1.72   0.03 
##  4 fever               1438  1887    0.94        0.81         1.1      0.47   0.95       0.81       1.1    0.49 
##  5 (Intercept)         1436  1907    1.33        1.24         1.42     0      1.33       1.03       1.72   0.03 
##  6 chills               354   469    1           0.86         1.16     0.98   1          0.86       1.17   1    
##  7 (Intercept)          269   333    1.24        1.05         1.45     0.01   1.33       1.03       1.72   0.03 
##  8 cough               1521  2043    1.09        0.91         1.29     0.36   1.09       0.92       1.3    0.32 
##  9 (Intercept)         1615  2126    1.32        1.23         1.4      0      1.33       1.03       1.72   0.03 
## 10 aches                175   250    1.09        0.89         1.33     0.43   1.1        0.9        1.35   0.37 
## 11 (Intercept)          900  1182    1.31        1.2          1.43     0      1.33       1.03       1.72   0.03 
## 12 vomit                890  1194    1.02        0.9          1.15     0.73   1.02       0.9        1.15   0.81 
## 13 (Intercept)          324   460    1.42        1.23         1.64     0      1.33       1.03       1.72   0.03 
## 14 age_cat5-9           300   478    1.12        0.92         1.37     0.26   1.13       0.92       1.38   0.25 
## 15 age_cat10-14         287   356    0.87        0.71         1.08     0.21   0.87       0.71       1.08   0.21 
## 16 age_cat15-19         247   311    0.89        0.71         1.1      0.28   0.88       0.71       1.1    0.27 
## 17 age_cat20-29         365   437    0.84        0.69         1.03     0.09   0.84       0.69       1.03   0.09 
## 18 age_cat30-49         235   285    0.85        0.68         1.07     0.17   0.84       0.67       1.06   0.15 
## 19 age_cat50-69          27    44    1.15        0.7          1.91     0.59   1.12       0.68       1.88   0.66 
## 20 age_cat70+             5     5    0.7         0.19         2.55     0.580  0.69       0.19       2.5    0.56

Forest plot

This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a package called easystats.

ggplot2 package

You can build a forest plot with ggplot() by plotting elements of the multivariate regression results. Add the layers:

  • estimates with geom_point()
  • confidence intervals with geom_errorbar()
  • a vertical line at OR = 1 with geom_vline()

You may want to re-arrange the order of the variables/levels on the y-axis (see how the order of age_cat levels is alphabetical and not sensical). To do this, use fct_relevel() from the forcats package to classify the column term as a factor and specify the order manually. See the page on Factors for more details.

## remove the intercept term from your multivariable results
mv_tab_base %>% 
  filter(term != "(Intercept)") %>% 
  ## plot with variable on the y axis and estimate (OR) on the x axis
  ggplot(aes(x = estimate, y = term)) +
  ## show the estimate as a point
  geom_point() + 
  ## add in an error bar for the confidence intervals
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) + 
  ## show where OR = 1 is for reference as a dashed line
  geom_vline(xintercept = 1, linetype = "dashed")

easystats packages

The alternative if you do not want to decide all of the different things required for a ggplot, is to use a combination of easystats packages. In this case the paramaters package function model_parameters does the equivalent of broom package function tidy. The see package then accepts those outputs and creates a default forest plot as a ggplot object.

## remove the intercept term from your multivariable results
final_mv_reg %>% 
  model_parameters(exponentiate = TRUE) %>% 
  plot()

Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary

sthda stepwise regression

Standardization

This page will show you two ways to standardize an outcome, such as hospitalizations or mortality, by characteristics such as age and sex.

  • Using dsr package
  • Using PHEindicatormethods package

We begin by extensively demonstrating the processes of data preparation/cleaning/joining, as this is common when combining population data from multiple countries, standard population data, deaths, etc.

Overview

There are two main ways to standardize: direct and indirect standardization. Let’s say we would like to standardize mortality by age and sex for country A and country B, and compare the standardized rates between these countries.

  • For direct standardization, you will have to know the number of the at-risk population and the number of deaths for each stratum of age and sex, for country A and country B. One stratum in our example could be females between ages 15-44.
  • For indirect standardization, you only need to know the total number of deaths and the age- and sex structure of each country. This option is therefore feasible if age- and sex-specific mortality rates or population numbers are not available. Indirect standardization is furthermore preferable in case of small numbers per stratum, as estimates in direct standardization would be influenced by substantial sampling variation.

Preparation

To show how standardization is done, we will use fictitious population counts and death counts from country A and country B, by age (in 5 year categories) and sex (female, male). To make the datasets ready for use, we will perform the following preparation steps:

  1. Load packages
  2. Load datasets
  3. Join the populaton and death data from the two countries
  4. Pivot longer so there is one row per age-sex stratum
  5. Clean the reference population (world standard population) and join it to the country data

In your scenario, your data may come in a different format. Perhaps your data are by province, city, or other catchment area. You may have one row for each death and information on age and sex for each (or a significant proportion) of these deaths. In this case, see the pages on Grouping data and Pivoting data to create a dataset with event and population counts per age-sex stratum.

We also need a reference population, the standard population. There are several standard populations available, for the purpose of this exercise we will use the world_standard_population_by_sex. The World standard population is based on the populations of 46 countries and was developed in 1960. There are many “standard” populations - as one example, the website of NHS Scotland is quite informative on the European Standard Population, World Standard Population and Scotland Standard Population.

Load packages

Load the packages required for this analysis:

pacman::p_load(
     rio,         # to import data
     here,        # to locate files
     tidyverse,   # to clean, handle, and plot the data (includes ggplot2 package)
     stringr,     # cleaning characters and strings
     frailtypack, # needed for dsr, for frailty models
     dsr,  
     PHEindicatormethods)

CAUTION: If you have a newer version of R, the dsr package cannot be directly downloaded as it is archived. However, it is still available from the CRAN archive. You can install and use this one.

For non-Mac users:

require(Rtools)
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
# Other solution that may work
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="http:/cran.us.r.project.org")

For Mac users:

require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="https://mac.R-project.org")

Load population data

First we load the demographic data (counts of males and females by 5-year age category) for the two countries that we will be comparing, “Country A” and “Country B”.

# Country A
countryA_demo <- import("country_demographics.csv")
# Country B
countryB_demo <- import("country_demographics_2.csv")

Load death counts

Conveniently, we also have the counts of deaths during the time period of interest, by age and sex. Each country’s counts are in a separate file, shown below.

Deaths in Country A

Deaths in Country B

Clean populations and deaths

We need to join and transform these data in the following ways:

  • Combine country populations into one dataset and pivot “long” so that each age-sex stratum is one row
  • Combine country death counts into one dataset and pivot “long” so each age-sex stratum is one row
  • Join the deaths to the populations

First, we combine the country populations datasets, pivot longer, and do minor cleaning. See the page on Pivoting data for more detail.

pop_countries <- countryA_demo %>%  # begin with country A dataset
     bind_rows(countryB_demo) %>%        # bind rows, because cols are identically named
     pivot_longer(                       # pivot longer
          cols = c(m, f),                   # column to transform into one
          names_to = "Sex",                 # name for new column containing the category ("m" or "f") 
          values_to = "Population") %>%     # name for new column containing the numeric values pivoted
     mutate(Sex = recode(                # recode values for clarity
          Sex,
          "m" = "Male",
          "f" = "Female"))

The population data now look like this:

And now we perform similar operations on the two deaths datasets.

deaths_countries <- A_deaths %>%    # begin with country A deaths dataset
     bind_rows(B_deaths) %>%        # bind rows with B dataset, because cols are identically named
     pivot_longer(                  # pivot longer
          cols = c(Male, Female),        # column to transform into one
          names_to = "Sex",              # name for new column containing the category ("m" or "f") 
          values_to = "Deaths") %>%      # name for new column containing the numeric values pivoted
     rename(age_cat5 = AgeCat)    # rename for clarity

The deaths data now look like this, and contains data from both countries:

We now join the deaths and population data based on common columns age_cat5, Sex, and Country.

country_data <- pop_countries %>% 
     left_join(deaths_countries, by = c("Country", "Sex", "age_cat5"))

We can now classify Sex, age_cat5, and Country as factors so their ordering is specified correctly. We use the as_factor() function from the forcats package, as described in the page on Factors. Note, classifying the factor levels doesn’t visibly change the data, but the arrange() command does sort it by Country, age category, and sex.

country_data <- country_data %>% 
     mutate(
          Country = as_factor(Country),
          Country = fct_relevel(Country, "A", "B"),
          
          Sex = as_factor(Sex),
          Sex = fct_relevel(Sex, "Male", "Female"),
          
          age_cat5 = as_factor(age_cat5),
          age_cat5 = fct_relevel(age_cat5,
                                 "0-4", "5-9", "10-14", "15-19",
                                 "20-24", "25-29",  "30-34", "35-39",
                                 "40-44", "45-49", "50-54", "55-59",
                                 "60-64", "65-69", "70-74",
                                 "75-79", "80-84", "85")) %>% 
          arrange(Country, age_cat5, Sex)

CAUTION: NB. If you have few deaths per stratum, use 10-, or 15-year categories, instead of 5-year categories for age, or combine categories

Load reference population

Lastly, we import the reference population (world “standard population” by sex)

# Reference population
standard_pop_data <- import("world_standard_population_by_sex.csv")

Clean reference population

The values of the column age_cat5 from the standard_pop_data contain the word “years” and “plus”, while those of the country_data do not. We will have to remove this string to make the age category values match. We use str_replace_all() from the stringr package, as described in the page on Characters and strings.

Furthermore, the package dsr expects that in the standard population, the column containing counts will be called "pop". So we rename that column accordingly.

# Remove specific string from column values
standard_pop_clean <- standard_pop_data %>%
     mutate(
          age_cat5 = str_replace_all(age_cat5, "years", ""),   # remove "year"
          age_cat5 = str_replace_all(age_cat5, "plus", ""),    # remove "plus"
          age_cat5 = str_replace_all(age_cat5, " ", "")) %>%   # remove " " space
     
     rename(pop = WorldStandardPopulation)   # change col name to "pop", as this is expected by dsr package

CAUTION: If you try to use str_replace_all() to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in str_replace_call(column, "\\+", "").

Finally, the package PHEindicatormethods, detailed below, expects the standard populations joined to the country event and population counts. So, we will create a dataset all_data for that purpose.

all_data <- left_join(country_data, standard_pop_clean, by=c("age_cat5", "Sex"))

This complete dataset looks like this:

dsr package

Below we demonstrate calculating and comparing directly standardized rates using the dsr package. The dsr package allows you to calculate and compare directly standardized rates (no indirectly standardized rates!).

In the data Preparation section, we made separate datasets for country counts and standard population:

  1. the country_data object, which is a population table with the number of population and number of deaths per stratum per country
  2. the standard_pop_clean object, containing the number of population per stratum for our reference population, the World Standard Population

We will use these separate datasets for the dsr approach.

Standardized rates

Below, we calculate rates per country directly standardized for age and sex. We use the dsr() function.

Of note - dsr() expects one data frame for the country populations and event counts (deaths), and a separate data frame with the reference population. It also expects that in this reference population dataset the unit-time column name is “pop” (we assured this in the data Preparation section).

There are many arguments, as annotated in the code below. Notably, event = is set to the column Deaths, and the fu = (“follow-up”) is set to the Population column. We set the subgroups of comparison as the column Country and we standardize based on age_cat5 and Sex. These last two columns are not assigned a particular named argument. See ?dsr for details.

# Calculate rates per country directly standardized for age and sex
mortality_rate <- dsr::dsr(
     data = country_data,  # specify object containing number of deaths per stratum
     event = Deaths,       # column containing number of deaths per stratum 
     fu = Population,      # column containing number of population per stratum
     subgroup = Country,   # units we would like to compare
     age_cat5,             # other columns - rates will be standardized by these
     Sex,
     refdata = standard_pop_clean, # reference population data frame, with column called pop
     method = "gamma",      # method to calculate 95% CI
     sig = 0.95,            # significance level
     mp = 100000,           # we want rates per 100.000 population
     decimals = 2)          # number of decimals)


# Print output as nice-looking HTML table
knitr::kable(mortality_rate) # show mortality rate before and after direct standardization
Subgroup Numerator Denominator Crude Rate (per 100000) 95% LCL (Crude) 95% UCL (Crude) Std Rate (per 100000) 95% LCL (Std) 95% UCL (Std)
A 11344 86790567 13.07 12.83 13.31 23.57 23.08 24.06
B 9955 52898281 18.82 18.45 19.19 19.33 18.46 20.22

Above, we see that while country A had a lower crude mortality rate than country B, it has a higher standardized rate after direct age and sex standardization.

Standardized rate ratios

# Calculate RR
mortality_rr <- dsr::dsrr(
     data = country_data, # specify object containing number of deaths per stratum
     event = Deaths,      # column containing number of deaths per stratum 
     fu = Population,     # column containing number of population per stratum
     subgroup = Country,  # units we would like to compare
     age_cat5,
     Sex,                 # characteristics to which we would like to standardize 
     refdata = standard_pop_clean, # reference population, with numbers in column called pop
     refgroup = "B",      # reference for comparison
     estimate = "ratio",  # type of estimate
     sig = 0.95,          # significance level
     mp = 100000,         # we want rates per 100.000 population
     decimals = 2)        # number of decimals

# Print table
knitr::kable(mortality_rr) 
Comparator Reference Std Rate (per 100000) Rate Ratio (RR) 95% LCL (RR) 95% UCL (RR)
A B 23.57 1.22 1.17 1.27
B B 19.33 1.00 0.94 1.06

The standardized mortality rate is 1.22 times higher in country A compared to country B (95% CI 1.17-1.27).

Standardized rate difference

# Calculate RD
mortality_rd <- dsr::dsrr(
     data = country_data,       # specify object containing number of deaths per stratum
     event = Deaths,            # column containing number of deaths per stratum 
     fu = Population,           # column containing number of population per stratum
     subgroup = Country,        # units we would like to compare
     age_cat5,                  # characteristics to which we would like to standardize
     Sex,                        
     refdata = standard_pop_clean, # reference population, with numbers in column called pop
     refgroup = "B",            # reference for comparison
     estimate = "difference",   # type of estimate
     sig = 0.95,                # significance level
     mp = 100000,               # we want rates per 100.000 population
     decimals = 2)              # number of decimals

# Print table
knitr::kable(mortality_rd) 
Comparator Reference Std Rate (per 100000) Rate Difference (RD) 95% LCL (RD) 95% UCL (RD)
A B 23.57 4.24 3.24 5.24
B B 19.33 0.00 -1.24 1.24

Country A has 4.24 additional deaths per 100.000 population (95% CI 3.24-5.24) compared to country A.

PHEindicatormethods package

Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates.

This package expects the reference (standard) population and the country-specific mortality and population data in one data frame, which we have made earlier: all_data.

Directly standardized rates

Below, we first group the data by Country and then pass it to the function phe_dsr() to get directly standardized rates.
See the help with ?phr_dsr or the links in the References section for more information.

# Calculate rates per country directly standardized for age and sex
mortality_rate_phe <- all_data %>%
     group_by(Country) %>%
     PHEindicatormethods::phe_dsr(
          x = Deaths,                 # column with observed number of events
          n = Population,             # Column with non-standard pops for each category
          stdpop = pop,               # standard populations for each stratum
          stdpoptype = "field")       # either "vector" for a standalone vector or "field" meaning std populations are in the data  

# Print table
knitr::kable(mortality_rate_phe)
Country total_count total_pop value lowercl uppercl confidence statistic method
A 11344 86790567 23.56685849327109 23.08106689236966 24.05943770431395 95% dsr per 100000 Dobson
B 9955 52898281 19.32549423719546 18.45515902513744 20.20881745377039 95% dsr per 100000 Dobson

Resources

TIP: If you would like to see another reproducible example than listed in this Handbook, please go to https://mran.microsoft.com/snapshot/2020-02-12/web/packages/dsr/vignettes/dsr.html.

PHEindicatormethods reference file: https://cran.r-project.org/web/packages/PHEindicatormethods/PHEindicatormethods.pdf

Moving averages

This page will cover two methods to calculate and visualize moving averages:

  1. Calculate with the slider package
  2. Calculate within a ggplot() command with the tidyquant package

Preparation

Load packages

pacman::p_load(
  tidyverse,      # for data management and viz
  slider,         # for calculating moving averages
  tidyquant       # for calculating moving averages within ggplot
)

Calculate with slider

Use this approach to calculate a moving average in a data frame prior to plotting.

The slider package provides several “sliding window” functions to compute rolling averages, cumulative sums, rolling regressions, etc. It treats a dataframe as a vector of rows, allowing iteration row-wise over a dataframe.

Here are some of the common functions:

  • slide_dbl() - iterates through a numeric column performing an operation using a sliding window
    • slide_sum() - rolling sum shortcut
    • slide_mean() - rolling average shortcut
  • slide_index_dbl() - applies the rolling window using a separate index column (useful if using dates or there are missing rows)

Core arguments

  • .x, the first argument by default, is the vector to iterate over and to apply the function to
  • .f, the second argument by default, either:
    • A function, written without parentheses, like mean, or
    • A formula, which will be converted into a function. For example ~ .x - mean(.x) will return the result of the current value minus the mean of the window’s value
  • For more details see this reference material

Window size

Specify the size of the window by using either .before, .after, or both arguments:

  • .before = - Provide an integer
  • .after = - Provide an integer
  • .complete = - Set this to TRUE if you only want calculation performed on complete windows

For example, to achieve a 7-day window including the current value and the six previous, use .before = 6. To achieve a “centered” window provide the same number to both .before = and .after =.

Expanding window

To achieve cumulative operations, set the .before = argument to Inf. This will conduct the operation on the current value and all coming before.

Rolling operations

Use slide_dbl(), which is made specifically to slide across a numeric vector.

To calculate the rolling mean of delay from onset to hospital admission (days_onset_hosp column):

rolled <- linelist %>%
  
  select(                             # select only some columns, for visibility
    case_id,                     
    date_onset,
    days_onset_hosp) %>% 
  
  mutate(
    delay_roll = slider::slide_dbl(   # define column delay_roll 
      .x        = days_onset_hosp,    # apply function to delays column
      .f        = mean,               # use mean()
      .before   = 2,                  # use value and 2 previous values
      .complete = TRUE))              # only use windows that have all three values present

Grouped data

You will see differences if your data are grouped and you have set .complete = TRUE. That argument being TRUE means that averages are only calculated for windows with no missing values. Normally, (on un-grouped data when .complete = TRUE) the first rows in a data frame will not return a value until the window is complete. However with grouped data, as the window moves down the data frame, any changes between groups acts as a “hard” barrier and these empty rows at the “beginning” will re-occur throughout the data frame until a complete window is calculable within that group. Thus, ordering the data frame (and verifying by eye!) is also important.

See handbook page on Grouping data for details on grouping data.

grouped_roll <- linelist %>%
  
  select(                             # select only some columns, for visibility
    case_id,                     
    date_onset,
    days_onset_hosp) %>%  
  
  group_by(                           # group by month of onset 
    month = lubridate::month(date_onset)) %>%
  
  mutate(                             # rolling mean, as before  
    delay_roll_month = slider::slide_dbl(
      .x = days_onset_hosp,
      .f = mean,
      .before = 2,
      .complete = TRUE
      )
    )

You see the difference below from above when the month of June begins:

DANGER: If you get an error saying “slide() was deprecated in tsibble 0.9.0 and is now defunct. Please use slider::slide() instead.”, it means that the slide() function from the tsibble package is masking the slide() function from slider package. Fix this by specifying the package in the command, such as slider::slide_dbl().

Indexed rolling

Often when conducting rolling operations by date (common with epidemiological linelists), we can encounter problems like:

  • Dates are missing from the dataframe, but should be included in a window

To solve this, use slide_index() from slider. It uses a separate column as an index for the rolling window. If this column is a date, it will know which dates are not present in the data and account for them. Below is an example to return a 7-day rolloing average of new cases reported per day:

  • First we count the number of cases reported each day with count() from dplyr (see page on Grouping data).
# make dataset of daily counts and 7-day moving average
counts_7day <- linelist %>% 
  
  # get counts
  count(
    date_onset,        # count rows per unique onset_date
    name = "new_cases" # name of new column
    ) %>%
  
  # remove counts with missing onset_date
  filter(!is.na(date_onset))

The new dataset now looks like this. Note how some days are not present (no cases on those days). A simple slide_dbl() would incorrectly include the first seven rows in the first window.

We use the function slide_index() specifically because we recognize that there are missing days in the above dataframe, and they must be accounted for when creating windows of time. We set our “index” column (.i argument) as the column date_onset. Because date_onset is a column of class Date, the function accounts for the days that do not appear in the dataframe. For the arguments .before and .after we can use integers, or use lubridate functions like days() and months().

## calculate 7-day rolling average, accounting for missing days
rolling <- counts_7day %>% 
  mutate(
    avg_7day = slider::slide_index_dbl(  # create new column
        new_cases,                       # calculate avg based on value in new_cases column
        .i = date_onset,                 # index column is date_onset, so non-present dates are included in 7day window 
        .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
        .before = days(6),               # window is the day and 6-days before
        .complete = TRUE))               # fills in first days with NA

You can see below that the time windows account for days that do not appear in the data.

We can now plot the linelist, with the 7-day moving average overlaid. If needed, see the page on ggplot tips.

ggplot(data = rolling, aes(x = date_onset))+
  geom_histogram(         # plot histogram of daily cases
    aes(y = new_cases),
    fill   ="#92a8d1",    # bar color
    stat   = "identity",  # height = value
    colour = "#92a8d1")+  # color around bars
  geom_line(              # overlay line
    aes(y = avg_7day),    # use 7-day average column
    color="red",         
    size = 1) +           # line thickness  
  scale_x_date(           # x-axis by months
    date_breaks = "1 month",
    date_labels = '%d/%m',
    expand = c(0,0)) +
  scale_y_continuous(
    expand = c(0,0),
    limits = c(0, NA)) + 
  labs(
    x="",
    y ="Number of confirmed cases")+ 
  theme_minimal() 

If you rolling average by months, you can use lubridate to group the data by month, and then apply slide_index_dbl() as below shown for a three-month rolling average:

ll_months <- linelist %>%
  mutate(
    month_onset = floor_date(date_onset, "month")) %>% 
  count(month_onset) %>% 
  filter(!is.na(month_onset)) %>% 
  mutate(
    monthly_roll = slider::slide_index_dbl(
      n,                                # calculate avg based on value in new_cases column
      .i = month_onset,                 # index column is date_onset, so non-present dates are included in 7day window 
      .f = ~mean(.x, na.rm = TRUE),     # function is mean() with missing values removed
      .before = months(2),              # window is the day and 6-days before
      .complete = TRUE))                # fills in first days with NA

Calculate with tidyquant within ggplot()

The package tidyquant offers another approach to calculating moving averages - this time from within a ggplot() command itself.

Below the linelist data are counted by date of onset, and this is plotted as a faded line (alpha < 1). Overlaid on top is a line created with geom_ma(), with a window of 7 days (n = 7) with specified color and thickness.

By default geom_ma() uses a simple moving average (ma_fun = "SMA"), but other types can be specified, such as:

  • “EMA” - exponential moving average (more weight to recent observations)
  • “WMA” - weighted moving average (wts are used to weight observations in the moving average)
  • Others can be found in the function documentation
linelist %>% 
  count(date_onset) %>%                 # count cases per day
  filter(!is.na(date_onset)) %>%        # remove cases missing onset date
  ggplot(aes(x = date_onset, y = n))+   # start ggplot
    geom_line(                          # plot raw values
      size = 1,
      alpha = 0.2                       # semi-transparent line
      )+             
    tidyquant::geom_ma(                 # plot moving average
      n = 7,           
      size = 1,
      color = "blue")+ 
  theme_minimal()                       # simple background

See this vignette for more details on the options available within tidyquant.

Resources

See the helpful online vignette for the slider package

The slider github page

A slider vignette

tidyquant vignette

If your use case requires that you “skip over” weekends and even holidays, you might like almanac package.

Time series and outbreak detection

Overview

This tab demonstrates the use of several packages for time series analysis. It primarily relies on packages from the tidyverts family, but will also use the RECON trending package to fit models that are more appropriate for infectious disease epidemiology.

  1. Time series data
  2. Descriptive analysis
  3. Fitting regressions
  4. Relation of two time series
  5. Outbreak detection
  6. Interrupted time series

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               tsibble,      # handle time series datasets
               slider,       # for calculating moving averages
               imputeTS,     # for filling in missing values
               feasts,       # for time series decomposition and autocorrelation
               forecast,     # fit sin and cosin terms to data (note: must load after feasts)
               trending,     # fit and assess models 
               tmaptools,    # for getting geocoordinates (lon/lat) based on place names
               ecmwfr,       # for interacting with copernicus sateliate CDS API
               stars,        # for reading in .nc (climate data) files
               units,        # for defining units of measurement (climate data)
               yardstick,    # for looking at model accuracy
               surveillance  # for aberration detection
               )

Load data

The example dataset used in this section:

  • Weekly counts of campylobacter cases reported in Germany between 2001 and 2011.

This dataset is a reduced version of the dataset available in the surveillance package. (for details load the surveillance package and see ?campyDE)

The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

# import the linelist
counts <- rio::import("campylobacter_germany.xlsx")

The first 10 rows of the counts are displayed below.

Clean data

The below makes sure that the date column is in the appropriate format. For this tab we will be using the tsibble package and so the yearweek function will be used to create a calendar week variable. There are several other ways of doing this (see the Working with dates page for details), however for time series its best to keep within one framework.

## ensure the date column is in the appropriate format
counts$date <- as.Date(counts$date)

## create a calendar week variable 
## fitting ISO definitons of weeks starting on a monday
counts <- counts %>% 
     mutate(epiweek = yearweek(date, week_start = 1))

Download climate data

In the relation of two time series section of this tab, we will be comparing campylobacter case counts to climate data.

Climate data for anywhere in the world can be downloaded from the EU’s Copernicus Satellite. These are not exact measurements, but based on a model (similar to interpolation), however the benefit is global hourly coverage as well as forecasts.

We will be using the ecmwfr package to pull data from the Copernicus climate data store. You will need to create a free account in order for this to work. The package website has a useful walkthrough of how to do this. Below is example code of how to go about doing this, once you have the appropriate API keys. You have to replace the X’s below with your account IDs. You will need to download one year of data at a time otherwise the server times-out.

If you are not sure of the coordinates for a location you want to download data for, you can use the tmaptools package to pull the coordinates off open street maps. An alternative option is the photon package, however this has not been released on to CRAN yet; the nice thing about photon is that it provides more contextual data for when there are several matches for your search.

## retrieve location coordinates
coords <- geocode_OSM("Germany", geometry = "point")

## pull together long/lats in format for ERA-5 querying (bounding box) 
## (as just want a single point can repeat coords)
request_coords <- str_glue_data(coords$coords, "{y}/{x}/{y}/{x}")


## Pulling data modelled from copernicus satellite (ERA-5 reanalysis)
## https://cds.climate.copernicus.eu/cdsapp#!/software/app-era5-explorer?tab=app
## https://github.com/bluegreen-labs/ecmwfr

## set up key for weather data 
wf_set_key(user = "XXXXX",
           key = "XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX",
           service = "cds") 

## run for each year of interest (otherwise server times out)
for (i in 2002:2011) {
  
  ## pull together a query 
  ## see here for how to do: https://bluegreen-labs.github.io/ecmwfr/articles/cds_vignette.html#the-request-syntax
  ## change request to a list using addin button above (python to list)
  ## Target is the name of the output file!!
  request <- request <- list(
    product_type = "reanalysis",
    format = "netcdf",
    variable = c("2m_temperature", "total_precipitation"),
    year = c(i),
    month = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"),
    day = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
            "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
            "25", "26", "27", "28", "29", "30", "31"),
    time = c("00:00", "01:00", "02:00", "03:00", "04:00", "05:00", "06:00", "07:00",
             "08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
             "16:00", "17:00", "18:00", "19:00", "20:00", "21:00", "22:00", "23:00"),
    area = request_coords,
    dataset_short_name = "reanalysis-era5-single-levels",
    target = paste0("germany_weather", i, ".nc")
  )
  
  ## download the file and store it in the current working directory
  file <- wf_request(user     = "XXXXX",  # user ID (for authentication)
                     request  = request,  # the request
                     transfer = TRUE,     # download the file
                     path     = here::here("data", "Weather")) ## path to save the data
  }

Load climate data

## define path to weather folder 
file_paths <- list.files(
  here::here("data", "weather"), 
  full.names = TRUE)

## only keep those with the current name of interest 
file_paths <- file_paths[str_detect(file_paths, "germany")]


## read in as a stars object 
data <- stars::read_stars(file_paths)
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp,
## change to a data frame 
temp_data <- as_tibble(data) %>% 
  ## add in variables and correct units
  mutate(
    ## create an calendar week variable 
    epiweek = tsibble::yearweek(time), 
    ## create a date variable (start of calendar week)
    date = as.Date(epiweek),
    ## change temperature from kelvin to celsius
    t2m = set_units(t2m, celsius), 
    ## change precipitation from metres to millimetres 
    tp  = set_units(tp, mm)) %>% 
  ## group by week (keep the date too though)
  group_by(epiweek, date) %>% 
  ## get the average per week
  summarise(t2m = as.numeric(mean(t2m)), 
            tp = as.numeric(mean(tp)))

Time series data

There are a number of different packages for structuring and handling time series data. As said, we will focus on the tidyverts family of packages and so will use the tsibble package to define our time series object. Having a data set defined as a time series object means it is much easier to structure our analysis.

To do this we use the tsibble() function and specify the “index”, i.e. the variable specifying the time unit of interest. In our case this is the epiweek variable.

If we had a data set with weekly counts by province, for example, we would also be able to specify the grouping variable using the key = argument. This would allow us to do analysis for each group.

## define time series object 
counts <- tsibble(counts, index = epiweek)

Looking at class(counts) tells you that on top of being a tidy data frame (“tbl_df”, “tbl”, “data.frame”), it has the additional properties of a time series data frame (“tbl_ts”).

You can take a quick look at your data by using ggplot2. We see from the plot that there is a clear seasonal pattern, and that there are no missings. However, there seems to be an issue with reporting at the beginning of each year; cases drop in the last week of the year and then increase for the first week of the next year.

## plot a line graph of cases by week
ggplot(counts, aes(x = epiweek, y = case)) + 
     geom_line()

DANGER: Most datasets aren’t as clean as this example. You will need to check for duplicates and missings as below.

Duplicates

tsibble does not allow duplicate observations. So each row will need to be unique, or unique within the group (key variable). The package has a few functions that help to identify duplicates. These include are_duplicated() which gives you a TRUE/FALSE vector of whether the row is a duplicate, and duplicates() which gives you a data frame of the duplicated rows.

See the page on De-duplication for more details on how to select rows you want.

## get a vector of TRUE/FALSE whether rows are duplicates
are_duplicated(counts, index = epiweek) 

## get a data frame of any duplicated rows 
duplicates(counts, index = epiweek) 

Missings

We saw from our brief inspection above that there are no missings, but we also saw there seems to be a problem with reporting delay around new year. One way to address this problem could be to set these values to missing and then to impute values. The simplest form of time series imputation is to draw a straight line between the last non-missing and the next non-missing value. To do this we will use the imputeTS package function na_interpolation().

See the Missing data page for other options for imputation.

Another alternative would be to calculating a moving average, to try and smooth over these apparent reporting issues (see next section, and the page on Moving averages).

## create a variable with missings instead of weeks with reporting issues
counts <- counts %>% 
     mutate(case_miss = if_else(
          ## if epiweek contains 52, 53, 1 or 2
          str_detect(epiweek, "W51|W52|W53|W01|W02"), 
          ## then set to missing 
          NA_real_, 
          ## otherwise keep the value in case
          case
     ))

## alternatively interpolate missings by linear trend 
## between two nearest adjacent points
counts <- counts %>% 
  mutate(case_int = na_interpolation(case_miss)
         )

## to check what values have been imputed compared to the original
ggplot_na_imputations(counts$case_miss, counts$case_int) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Descriptive analysis

Moving averages

If data is very noisy (counts jumping up and down) then it can be helpful to calculate a moving average. In the example below, for each week we calculate the average number of cases from the four previous weeks. This smooths the data, to make it more interpretable. In our case this does not really add much, so we will stick to the interpolated data for further analysis. See the Moving averages page for more detail.

## create a moving average variable (deals with missings)
counts <- counts %>% 
     ## create the ma_4w variable 
     ## slide over each row of the case variable
     mutate(ma_4wk = slider::slide_dbl(case, 
                               ## for each row calculate the name
                               ~ mean(.x, na.rm = TRUE),
                               ## use the four previous weeks
                               .before = 4))

## make a quick visualisation of the difference 
ggplot(counts, aes(x = epiweek)) + 
     geom_line(aes(y = case)) + 
     geom_line(aes(y = ma_4wk), colour = "red")

Periodicity

## x is a dataset
## counts is variable with count data or rates within x 
## start_week is the first week in your dataset
## period is how many units in a year 
## output is whether you want return spectral periodogram or the peak weeks
  ## "periodogram" or "weeks"
periodogram <- function(x, 
                        counts, 
                        start_week = c(2002, 1), 
                        period = 52, 
                        output = "weeks") {
  

    ## make sure is not a tsibble, filter to project and only keep columns of interest
    prepare_data <- dplyr::as_tibble(x)
    # prepare_data <- prepare_data[prepare_data[[strata]] == j, ]
    prepare_data <- dplyr::select(prepare_data, {{counts}})
    
    ## create an intermediate "zoo" time series to be able to use with spec.pgram
    zoo_cases <- zoo::zooreg(prepare_data, 
                             start = start_week, frequency = period)
    
    ## get a spectral periodogram not using fast fourier transform 
    periodo <- spec.pgram(zoo_cases, fast = FALSE, plot = FALSE)
    
    ## return the peak weeks 
    periodo_weeks <- 1 / periodo$freq[order(-periodo$spec)] * period
    
    if (output == "weeks") {
      periodo_weeks
    } else {
      periodo
    }
    
}

## get spectral periodogram for extracting weeks with the highest frequencies 
## (checking of seasonality) 
periodo <- periodogram(counts, 
                       case_int, 
                       start_week = c(2002, 1),
                       output = "periodogram")

## pull spectrum and frequence in to a dataframe for plotting
periodo <- data.frame(periodo$freq, periodo$spec)

## plot a periodogram showing the most frequently occuring periodicity 
ggplot(data = periodo, 
                aes(x = 1/(periodo.freq/52),  y = log(periodo.spec))) + 
  geom_line() + 
  labs(x = "Period (Weeks)", y = "Log(density)")

## get a vector weeks in ascending order 
peak_weeks <- periodogram(counts, 
                          case_int, 
                          start_week = c(2002, 1), 
                          output = "weeks")

NOTE: It is possible to use the above weeks to add them to sin and cosine terms, however we will use a function to generate these terms (see regression section below)

Decomposition

Classical decomposition is used to break a time series down several parts, which when taken together make up for the pattern you see. These different parts are:

  • The trend-cycle (the long-term direction of the data)
  • The seasonality (repeating patterns)
  • The random (what is left after removing trend and season)
## decompose the counts dataset 
counts %>% 
  # using an additive classical decomposition model
  model(classical_decomposition(case_int, type = "additive")) %>% 
  ## extract the important information from the model
  components() %>% 
  ## generate a plot 
  autoplot()

Autocorrelation

Autocorrelation tells you about the relation between the counts of each week and the weeks before it (called lags).

Using the ACF() function, we can produce a plot which shows us a number of lines for the relation at different lags. Where the lag is 0 (x = 0), this line would always be 1 as it shows the relation between an observation and itself (not shown here). The first line shown here (x = 1) shows the relation between each observation and the observation before it (lag of 1), the second shows the relation between each observation and the observation before last (lag of 2) and so on until lag of 52 which shows the relation between each observation and the observation from 1 year (52 weeks before).

Using the PACF() function (for partial autocorrelation) shows the same type of relation but adjusted for all other weeks between. This is less informative for determining periodicity.

## using the counts dataset
counts %>% 
  ## calculate autocorrelation using a full years worth of lags
  ACF(case_int, lag_max = 52) %>% 
  ## show a plot
  autoplot()

## using the counts data set 
counts %>% 
  ## calculate the partial autocorrelation using a full years worth of lags
  PACF(case_int, lag_max = 52) %>% 
  ## show a plot
  autoplot()

You can formally test the null hypothesis of independence in a time series (i.e.  that it is not autocorrelated) using the Ljung-Box test (in the stats package). A significant p-value suggests that there is autocorrelation in the data.

## test for independance 
Box.test(counts$case_int, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  counts$case_int
## X-squared = 472.95448128759, df = 1, p-value < 0.00000000000000022204460493

Fitting regressions

It is possible to fit a large number of different regressions to a time series, however, here we will demonstrate how to fit a negative binomial regression - as this is often the most appropriate for counts data in infectious diseases.

Fourier terms

Fourier terms are the equivalent of sin and cosin curves. The difference is that these are fit based on finding the most appropriate combination of curves to explain your data.

If only fitting one fourier term, this would be the equivalent of fitting a sin and a cosin for your most frequently occurring lag seen in your periodogram (in our case 52 weeks). We use the fourier() function from the forecast package.

In the below code we assign using the $, as fourier() returns two columns (one for sin one for cosin) and so these are added to the dataset as a list, called “fourier” - but this list can then be used as a normal variable in regression.

## add in fourier terms using the epiweek and case_int variabless
counts$fourier <- select(counts, epiweek, case_int) %>% 
  fourier(K = 1)

Negative binomial

It is possible to fit regressions using base stats or MASS functions (e.g. lm(), glm() and glm.nb()). However we will be using those from the trending package, as this allows for calculating appropriate confidence and prediction intervals (which are otherwise not available). The syntax is the same, and you specify an outcome variable then a tilde (~) and then add your various exposure variables of interest separated by a plus (+).

The other difference is that we first define the model and then fit() it to the data. This is useful because it allows for comparing multiple different models with the same syntax.

TIP: If you wanted to use rates, rather than counts you could include the population variable as a logarithmic offset term, by adding offset(log(population). You would then need to set population to be 1, before using predict() in order to produce a rate.

TIP: For fitting more complex models such as ARIMA or prophet, see the fable package.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the fourier terms to account for seasonality
    fourier)

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model)

## plot your regression 
ggplot(data = observed, aes(x = epiweek)) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate),
            col = "Red") + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add in a line for your observed case counts
  geom_line(aes(y = case_int), 
            col = "black") + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Residuals

To see how well our model fits the observed data we need to look at the residuals. The residuals are the difference between the observed counts and the counts estimated from the model. We could calculate this simply by using case_int - estimate, but the residuals() function extracts this directly from the regression for us.

What we see from the below, is that we are not explaining all of the variation that we could with the model. It might be that we should fit more fourier terms, and address the amplitude. However for this example we will leave it as is. The plots show that our model does worse in the peaks and troughs (when counts are at their highest and lowest) and that it might be more likely to underestimate the observed counts.

## calculate the residuals 
observed <- observed %>% 
  mutate(resid = residuals(fitted_model$fitted_model, type = "response"))

## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
observed %>%
  ggplot(aes(x = epiweek, y = resid)) +
  geom_line() +
  geom_point() + 
  labs(x = "epiweek", y = "Residuals")

## is there autocorelation in the residuals (is there a pattern to the error?)  
observed %>% 
  as_tsibble(index = epiweek) %>% 
  ACF(resid, lag_max = 52) %>% 
  autoplot()

## are residuals normally distributed (are under or over estimating?)  
observed %>%
  ggplot(aes(x = resid)) +
  geom_histogram(binwidth = 100) +
  geom_rug() +
  labs(y = "count") 

## compare observed counts to their residuals 
  ## should also be no pattern 
observed %>%
  ggplot(aes(x = estimate, y = resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Residuals")

## formally test autocorrelation of the residuals
## H0 is that residuals are from a white-noise series (i.e. random)
## test for independence 
## if p value significant then non-random
Box.test(observed$resid, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  observed$resid
## X-squared = 356.55296553324, df = 1, p-value < 0.00000000000000022204460493

Relation of two time series

Here we look at using weather data (specifically the temperature) to explain campylobacter case counts.

Merging datasets

We can join our datasets using the week variable. For more on merging see the handbook section on [joining].

## left join so that we only have the rows already existing in counts
## drop the date variable from temp_data (otherwise is duplicated)
counts <- left_join(counts, 
                    select(temp_data, -date),
                    by = "epiweek")

Descriptive analysis

First plot your data to see if there is any obvious relation. The plot below shows that there is a clear relation in the seasonality of the two variables, and that temperature might peak a few weeks before the case number. For more on pivoting data, see the handbook section on [cleaning data].

counts %>% 
  ## keep the variables we are interested 
  select(epiweek, case_int, t2m) %>% 
  ## change your data in to long format
  pivot_longer(
    ## use epiweek as your key
    !epiweek,
    ## move column names to the new "measure" column
    names_to = "measure", 
    ## move cell values to the new "values" column
    values_to = "value") %>% 
  ## create a plot with the dataset above
  ## plot epiweek on the x axis and values (counts/celsius) on the y 
  ggplot(aes(x = epiweek, y = value)) + 
    ## create a separate plot for temperate and case counts 
    ## let them set their own y-axes
    facet_grid(measure ~ ., scales = "free_y") +
    ## plot both as a line
    geom_line()

Lags and cross-correlation

To formally test which weeks are most highly related between cases and temperature. We can use the cross-correlation function (CCF()) from the feasts package. You could also visualise (rather than using arrange) using the autoplot() function.

counts %>% 
  ## calculate cross-correlation between interpolated counts and temperature
  CCF(case_int, t2m,
      ## set the maximum lag to be 52 weeks
      lag_max = 52, 
      ## return the correlation coefficient 
      type = "correlation") %>% 
  ## arange in decending order of the correlation coefficient 
  ## show the most associated lags
  arrange(-ccf) %>% 
  ## only show the top ten 
  slice_head(n = 10)
## Warning: Current temporal ordering may yield unexpected results.
## i Suggest to sort by ``, `lag` first.
## # A tsibble: 10 x 2 [1W]
##      lag   ccf
##    <lag> <dbl>
##  1    4W 0.750
##  2    5W 0.746
##  3    3W 0.736
##  4    6W 0.731
##  5    2W 0.727
##  6    7W 0.705
##  7    1W 0.694
##  8    8W 0.671
##  9    0W 0.647
## 10  -47W 0.640

We see from this that a lag of 4 weeks is most highly correlated, so we make a lagged temperature variable to include in our regression.

counts <- counts %>% 
  ## create a new variable for temperature lagged by four weeks
  mutate(t2m_lag4 = lag(t2m, n = 4))

Negative binomial with two variables

We fit a negative binomial regression as done previously. This time we add the temperature variable lagged by four weeks.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the furier terms to account for seasonality
    fourier + 
    ## use the temperature lagged by four weeks 
    t2m_lag4
    )

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model)

To investigate the individual terms, we can pull the original negative binomial regression out of the trending format using get_model() and pass this to the broom package tidy() function to retrieve exponentiated estimates and associated confidence intervals.

What this shows us is that lagged temperature, after controlling for trend and seasonality, is similar to the case counts (estimate ~ 1) and significantly associated. This suggests that it might be a good variable for use in predicting future case numbers (as climate forecasts are readily available).

fitted_model %>% 
  ## extract original negative binomial regression
  get_model() %>% 
  ## get a tidy dataframe of results
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)
## Warning: Tidiers for objects of class negbin are not maintained by the broom team, and are only supported through the glmlm tidier
## method. Please be cautious in interpreting and reporting broom output.
## # A tibble: 5 x 7
##   term         estimate  std.error statistic  p.value conf.low conf.high
##   <chr>           <dbl>      <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   369.    0.105          56.4  0.        300.      453.   
## 2 epiweek         1.00  0.00000747     10.5  9.57e-26    1.00      1.00 
## 3 fourierS1-52    0.753 0.0213        -13.3  1.98e-40    0.723     0.785
## 4 fourierC1-52    0.816 0.0198        -10.3  9.30e-25    0.786     0.848
## 5 t2m_lag4        1.01  0.00267         2.36 1.81e- 2    1.00      1.01

A quick visual inspection of the model shows that it might do a better job of estimating the observed case counts.

## plot your regression 
ggplot(data = observed, aes(x = epiweek)) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate),
            col = "Red") + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add in a line for your observed case counts
  geom_line(aes(y = case_int), 
            col = "black") + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Residuals

We investigate the residuals again to see how well our model fits the observed data. The results and interpretation here are similar to those of the previous regression, so it may be more feasible to stick with the simpler model without temperature.

## calculate the residuals 
observed <- observed %>% 
  mutate(resid = case_int - estimate)

## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
observed %>%
  ggplot(aes(x = epiweek, y = resid)) +
  geom_line() +
  geom_point() + 
  labs(x = "epiweek", y = "Residuals")
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).

## is there autocorelation in the residuals (is there a pattern to the error?)  
observed %>% 
  as_tsibble(index = epiweek) %>% 
  ACF(resid, lag_max = 52) %>% 
  autoplot()

## are residuals normally distributed (are under or over estimating?)  
observed %>%
  ggplot(aes(x = resid)) +
  geom_histogram(binwidth = 100) +
  geom_rug() +
  labs(y = "count") 
## Warning: Removed 4 rows containing non-finite values (stat_bin).

## compare observed counts to their residuals 
  ## should also be no pattern 
observed %>%
  ggplot(aes(x = estimate, y = resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Residuals")
## Warning: Removed 4 rows containing missing values (geom_point).

## formally test autocorrelation of the residuals
## H0 is that residuals are from a white-noise series (i.e. random)
## test for independence 
## if p value significant then non-random
Box.test(observed$resid, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  observed$resid
## X-squared = 349.56559102769, df = 1, p-value < 0.00000000000000022204460493

Outbreak detection

We will demonstrate two (similar) methods of detecting outbreaks here. The first builds on the sections above. We use the trending package to fit regressions to previous years, and then predict what we expect to see in the following year. If observed counts are above what we expect, then it could suggest there is an outbreak. The second method is based on similar principles but uses the surveillance package, which has a number of different algorithms for aberration detection.

CAUTION: Normally, you are interested in the current year (where you only know counts up to the present week). So in this example we are pretending to be in week 52 of 2011.

surveillance package

In this section we use the surveillance package to create alert thresholds based on outbreak detection algorithms. There are several different methods available in the package, however we will focus on two options here. For details, see these papers on the application and theory of the alogirthms used.

The first option uses the improved Farrington method. This fits a negative binomial glm (including trend) and down-weights past outbreaks (outliers) to create a threshold level.

The second option use the glrnb method. This also fits a negative binomial glm but includes trend and fourier terms (so is favoured here). The regression is used to calculate the “control mean” (~fitted values) - it then uses a computed generalized likelihood ratio statistic to assess if there is shift in the mean for each week. Note that the threshold for each week takes in to account previous weeks so if there is a sustained shift an alarm will be triggered. (Also note that after each alarm the algorithm is reset)

In order to work with the surveillance package, we first need to define a “surveillance time series” object (using the sts() function) to fit within the framework.

## define surveillance time series object
## nb. you can include a denominator with the population object (see ?sts)
counts_sts <- sts(observed = counts$case_int,
                  start = c(
                    ## subset to only keep the year from start_date 
                    as.numeric(str_sub(start_date, 1, 4)), 
                    ## subset to only keep the week from start_date
                    as.numeric(str_sub(start_date, 7, 8))), 
                  ## define the type of data (in this case weekly)
                  freq = 52)

## define the week range that you want to include (ie. prediction period)
## nb. the sts object only counts observations without assigning a week or 
## year identifier to them - so we use our data to define the appropriate observations
weekrange <- cut_off - start_date

Farrington method

We then define each of our parameters for the Farrington method in a list. Then we run the algorithm using farringtonFlexible() and then we can extract the threshold for an alert using farringtonmethod@upperboundto include this in our dataset. It is also possible to extract a TRUE/FALSE for each week if it triggered an alert (was above the threshold) using farringtonmethod@alarm.

## define control
ctrl <- list(
  ## define what time period that want threshold for (i.e. 2011)
  range = which(counts_sts@epoch > weekrange),
  b = 9, ## how many years backwards for baseline
  w = 2, ## rolling window size in weeks
  weightsThreshold = 2.58, ## reweighting past outbreaks (improved noufaily method - original suggests 1)
  ## pastWeeksNotIncluded = 3, ## use all weeks available (noufaily suggests drop 26)
  trend = TRUE,
  pThresholdTrend = 1, ## 0.05 normally, however 1 is advised in the improved method (i.e. always keep)
  thresholdMethod = "nbPlugin",
  populationOffset = TRUE
  )

## apply farrington flexible method
farringtonmethod <- farringtonFlexible(counts_sts, ctrl)

## create a new variable in the original dataset called threshold
## containing the upper bound from farrington 
## nb. this is only for the weeks in 2011 (so need to subset rows)
counts[which(counts$epiweek >= cut_off),
              "threshold"] <- farringtonmethod@upperbound

We can then visualise the results in ggplot as done previously.

ggplot(counts, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in upper bound of aberration algorithm
  geom_line(aes(y = threshold, colour = "Alert threshold"), 
            linetype = "dashed", 
            size = 1.5) +
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Alert threshold" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic() + 
  ## remove title of legend 
  theme(legend.title = element_blank())

GLRNB method

Similarly for the GLRNB method we define each of our parameters for the in a list, then fit the algorithm and extract the upper bounds.

CAUTION: This method uses “brute force” (similar to bootstrapping) for calculating thresholds, so can take a long time!

See the GLRNB vignette for details.

## define control options
ctrl <- list(
  ## define what time period that want threshold for (i.e. 2020)
  range = which(counts_sts@epoch > weekrange),
  mu0 = list(S = 1,    ## number of fourier terms (harmonics) to include
  trend = TRUE,   ## whether to include trend or not
  refit = FALSE), ## whether to refit model after each alarm
  ## cARL = threshold for GLR statistic (arbitrary)
     ## 3 ~ middle ground for minimising false positives
     ## 1 fits to the 99%PI of glm.nb - with changes after peaks (threshold lowered for alert)
   c.ARL = 2,
   # theta = log(1.5), ## equates to a 50% increase in cases in an outbreak
   ret = "cases"     ## return threshold upperbound as case counts
  )

## apply the glrnb method
glrnbmethod <- glrnb(counts_sts, control = ctrl, verbose = FALSE)

## create a new variable in the original dataset called threshold
## containing the upper bound from glrnb 
## nb. this is only for the weeks in 2011 (so need to subset rows)
counts[which(counts$epiweek >= cut_off),
              "threshold_glrnb"] <- glrnbmethod@upperbound

Visualise the outputs as previously.

ggplot(counts, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in upper bound of aberration algorithm
  geom_line(aes(y = threshold_glrnb, colour = "Alert threshold"), 
            linetype = "dashed", 
            size = 1.5) +
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Alert threshold" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic() + 
  ## remove title of legend 
  theme(legend.title = element_blank())

Interrupted timeseries

Interrupted timeseries (also called segmented regression or intervention analysis), is often used in assessing the impact of vaccines on the incidence of disease. But it can be used for assessing impact of a wide range of interventions or introductions. For example changes in hospital procedures or the introduction of a new disease strain to a population. In this example we will pretend that a new strain of Campylobacter was introduced to Germany at the end of 2008, and see if that affects the number of cases. We will use negative binomial regression again. The regression this time will be split in to two parts, one before the intervention (or introduction of new strain here) and one after (the pre and post-periods). This allows us to calculate an incidence rate ratio comparing the two time periods. Explaining the equation might make this clearer (if not then just ignore!).

The negative binomial regression can be defined as follows:

\[\log(Y_t)= β_0 + β_1 \times t+ β_2 \times δ(t-t_0) + β_3\times(t-t_0 )^+ + log(pop_t) + e_t\]

Where: \(Y_t\)is the number of cases observed at time \(t\)
\(pop_t\) is the population size in 100,000s at time \(t\) (not used here)
\(t_0\) is the last year of the of the pre-period (including transition time if any)
\(δ(x\) is the indicator function (it is 0 if x≤0 and 1 if x>0)
\((x)^+\) is the cut off operator (it is x if x>0 and 0 otherwise)
\(e_t\) denotes the residual Additional terms trend and season can be added as needed.

\(β_2 \times δ(t-t_0) + β_3\times(t-t_0 )^+\) is the generalised linear part of the post-period and is zero in the pre-period. This means that the \(β_2\) and \(β_3\) estimates are the effects of the intervention.

We need to re-calculate the fourier terms without forecasting here, as we will use all the data available to us (i.e. retrospectively). Additionally we need to calculate the extra terms needed for the regression.

## add in fourier terms using the epiweek and case_int variabless
counts$fourier <- select(counts, epiweek, case_int) %>% 
  as_tsibble(index = epiweek) %>% 
  fourier(K = 1)

## define intervention week 
intervention_week <- yearweek("2008-12-31")

## define variables for regression 
counts <- counts %>% 
  mutate(
    ## corresponds to t in the formula
      ## count of weeks (could probably also just use straight epiweeks var)
    # linear = row_number(epiweek), 
    ## corresponds to delta(t-t0) in the formula
      ## pre or post intervention period
    intervention = as.numeric(epiweek >= intervention_week), 
    ## corresponds to (t-t0)^+ in the formula
      ## count of weeks post intervention
      ## (choose the larger number between 0 and whatever comes from calculation)
    time_post = pmax(0, epiweek - intervention_week + 1))

We then use these terms to fit a negative binomial regression, and produce a table with percentage change. What this example shows is that there was no significant change.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the furier terms to account for seasonality
    fourier + 
    ## add in whether in the pre- or post-period 
    intervention + 
    ## add in the time post intervention 
    time_post
    )

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model)



## show estimates and percentage change in a table
fitted_model %>% 
  ## extract original negative binomial regression
  get_model() %>% 
  ## get a tidy dataframe of results
  tidy(exponentiate = TRUE, 
       conf.int = TRUE) %>% 
  ## only keep the intervention value 
  filter(term == "intervention") %>% 
  ## change the IRR to percentage change for estimate and CIs 
  mutate(
    ## for each of the columns of interest - create a new column
    across(
      all_of(c("estimate", "conf.low", "conf.high")), 
      ## apply the formula to calculate percentage change
            .f = function(i) 100 * (i - 1), 
      ## add a suffix to new column names with "_perc"
      .names = "{.col}_perc")
    ) %>% 
  ## only keep (and rename) certain columns 
  select("IRR" = estimate, 
         "95%CI low" = conf.low, 
         "95%CI high" = conf.high,
         "Percentage change" = estimate_perc, 
         "95%CI low (perc)" = conf.low_perc, 
         "95%CI high (perc)" = conf.high_perc,
         "p-value" = p.value)
## # A tibble: 1 x 7
##     IRR `95%CI low` `95%CI high` `Percentage change` `95%CI low (perc)` `95%CI high (perc)` `p-value`
##   <dbl>       <dbl>        <dbl>               <dbl>              <dbl>               <dbl>     <dbl>
## 1 0.958       0.896         1.03               -4.17              -10.4                2.53     0.220

As previously we can visualise the outputs of the regression.

ggplot(observed, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate, col = "Estimate")) + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add vertical line and label to show where forecasting started
  geom_vline(
           xintercept = as.Date(intervention_week), 
           linetype = "dashed") + 
  annotate(geom = "text", 
           label = "Intervention", 
           x = intervention_week, 
           y = max(observed$upper_pi), 
           angle = 90, 
           vjust = 1
           ) + 
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Estimate" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Epidemic modeling

Overview

There exists a growing body of tools for epidemic modelling that lets us conduct fairly complex analyses with minimal effort. This section will provide an overview on how to use these tools to:

  • estimate the effective reproduction number Rt and related statistics such as the doubling time
  • produce short-term projections of future incidence

It is not intended as an overview of the methodologies and statistical methods underlying these tools, so please refer to the Resources tab for links to some papers covering this. Make sure you have an understanding of the methods before using these tools; this will ensure you can accurately interpret their results.

Below is an example of one of the outputs we’ll be producing in this section.

Preparation

We will use two different methods and packages for Rt estimation, namely EpiNow and EpiEstim, as well as the projections package for forecasting case incidence.

pacman::p_load(
   rio,          # File import
   here,         # File locator
   tidyverse,    # Data management + ggplot2 graphics
   epicontacts,  # Analysing transmission networks
   EpiNow2,      # Rt estimation
   EpiEstim,     # Rt estimation
   projections,  # Incidence projections
   incidence,    # Handling incidence data
   epitrix,      # Useful epi functions
   distcrete     # Discrete delay distributions
)

We will use the standard, cleaned linelist for all analyses in this section.

# import the cleaned linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Estimating Rt

EpiNow2 vs. EpiEstim

The reproduction number R is a measure of the transmissibility of a disease and is defined as the expected number of secondary cases per infected case. In a fully susceptible population, this value represents the basic reproduction number R0. However, as the number of susceptible individuals in a population changes over the course of an outbreak or pandemic, and as various response measures are implemented, the most commonly used measure of transmissibility is the effective reproduction number Rt; this is defined as the expected number of secondary cases per infected case at a given time t.

The EpiNow2 package provides the most sophisticated framework for estimating Rt. It has two key advantages over the other commonly used package, EpiEstim:

  • It accounts for delays in reporting and can therefore estimate Rt even when recent data is incomplete.
  • It estimates Rt on dates of infection rather than the dates of onset of reporting, which means that the effect of an intervention will be immediately reflected in a change in Rt, rather than with a delay.

However, it also has two key disadvantages:

  • It requires knowledge of the generation time distribution (i.e. distribution of delays between infection of a primary and secondary cases), incubation period distribution (i.e. distribution of delays between infection and symptom onset) and any further delay distribution relevant to your data (e.g. if you have dates of reporting, you require the distribution of delays from symptom onset to reporting). While this will allow more accurate estimation of Rt, EpiEstim only requires the serial interval distribution (i.e. the distribution of delays between symptom onset of a primary and a secondary case), which may be the only distribution available to you.
  • EpiNow2 is significantly slower than EpiEstim, anecdotally by a factor of about 100-1000! For example, estimating Rt for the sample outbreak considered in this section takes about four hours (this was run for a large number of iterations to ensure high accuracy and could probably be reduced if necessary, however the points stands that the algorithm is slow in general). This may be unfeasible if you are regularly updating your Rt estimates.

Which package you choose to use will therefore depend on the data, time and computational resources available to you.

EpiNow2

Estimating delay distributions

The delay distributions required to run EpiNow2 depend on the data you have. Essentially, you need to be able to describe the delay from the date of infection to the date of the event you want to use to estimate Rt. If you are using dates of onset, this would simply be the incubation period distribution. If you are using dates of reporting, you require the delay from infection to reporting. As this distribution is unlikely to be known directly, EpiNow2 lets you chain multiple delay distributions together; in this case, the delay from infection to symptom onset (e.g. the incubation period, which is likely known) and from symptom onset to reporting (which you can often estimate from the data).

As we have the dates of onset for all our cases in the example linelist, we will only require the incubation period distribution to link our data (e.g. dates of symptom onset) to the date of infection. We can either estimate this distribution from the data or use values from the literature.

A literature estimate of the incubation period of Ebola (taken from this paper) with a mean of 9.1, standard deviation of 7.3 and maximum value of 30 would be specified as follows:

incubation_period_lit <- list(
  mean = log(9.1),
  mean_sd = log(0.1),
  sd = log(7.3),
  sd_sd = log(0.1),
  max = 30
)

Note that EpiNow2 requires these delay distributions to be provided on a log scale, hence the log call around each value (except the max parameter which, confusingly, has to be provided on a natural scale). The mean_sd and sd_sd define the standard deviation of the mean and standard deviation estimates. As these are not known in this case, we choose the fairly arbitrary value of 0.1.

In this analysis, we instead estimate the incubation period distribution from the linelist itself using the function bootstrapped_dist_fit, which will fit a lognormal distribution to the observed delays between infection and onset in the linelist.

## estimate incubation period
incubation_period <- bootstrapped_dist_fit(
  linelist$date_onset - linelist$date_infection,
  dist = "lognormal",
  max_value = 100,
  bootstraps = 1
)

The other distribution we require is the generation time. As we have data on infection times and transmission links, we can estimate this distribution from the linelist by calculating the delay between infection times of infector-infectee pairs. To do this, we use the handy get_pairwise function from the package epicontacts, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains chapter for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in infection times between transmission pairs, calculated using get_pairwise, to a gamma distribution:

## estimate gamma generation time
generation_time <- bootstrapped_dist_fit(
  get_pairwise(epic, "date_infection"),
  dist = "gamma",
  max_value = 20,
  bootstraps = 1
)

Running EpiNow2

Now we just need to calculate daily incidence from the linelist, which we can do easily with the dplyr functions group_by() and n(). Note that EpiNow2 requires the column names to be date and confirm.

## get incidence from onset dates
cases <- linelist %>%
  group_by(date = date_onset) %>%
  summarise(confirm = n())

We can then estimate Rt using the epinow function. Some notes on the inputs:

  • We can provide any number of ‘chained’ delay distributions to the delays argument; we would simply insert them alongside the incubation_period object within the delay_opts function.
  • return_output ensures the output is returned within R and not just saved to a file.
  • verbose specifies that we want a readout of the progress.
  • horizon indicates how many days we want to project future incidence for.
  • We pass additional options to the stan argument to specify how how long we want to run the inference for. Increasing samples and chains will give you a more accurate estimate that better characterises uncertainty, however will take longer to run.
## run epinow
epinow_res <- epinow(
  reported_cases = cases,
  generation_time = generation_time,
  delays = delay_opts(incubation_period),
  return_output = TRUE,
  verbose = TRUE,
  horizon = 21,
  stan = stan_opts(samples = 750, chains = 4)
)

Analysing outputs

Once the code has finished running, we can plot a summary very easily as follows:

## plot summary figure
plot(epinow_res)

We can also look at various summary statistics:

## summary table
epinow_res$summary
##                                  measure                  estimate   numeric_estimate
## 1: New confirmed cases by infection date                4 (2 -- 6)  <data.table[1x9]>
## 2:        Expected change in daily cases                    Unsure 0.5600000000000001
## 3:            Effective reproduction no.        0.88 (0.73 -- 1.1)  <data.table[1x9]>
## 4:                        Rate of growth -0.012 (-0.028 -- 0.0052)  <data.table[1x9]>
## 5:          Doubling/halving time (days)          -60 (130 -- -25)  <data.table[1x9]>

For further analyses and custom plotting, you can access the summarised daily estimates via $estimates$summarised. We will convert this from the default data.table to a tibble for ease of use with dplyr.

## extract summary and convert to tibble
estimates <- as_tibble(epinow_res$estimates$summarised)
estimates

As an example, let’s make a plot of the doubling time and Rt. We will only look at the first few months of the outbreak when Rt is well above one, to avoid plotting extremely high doublings times.

We use the formula log(2)/growth_rate to calculate the doubling time from the estimated growth rate.

## make wide df for median plotting
df_wide <- estimates %>%
  filter(
    variable %in% c("growth_rate", "R"),
    date < as.Date("2014-09-01")
  ) %>%
  ## convert growth rates to doubling times
  mutate(
    across(
      c(median, lower_90:upper_90),
      ~ case_when(
        variable == "growth_rate" ~ log(2)/.x,
        TRUE ~ .x
      )
    ),
    ## rename variable to reflect transformation
    variable = replace(variable, variable == "growth_rate", "doubling_time")
  )

## make long df for quantile plotting
df_long <- df_wide %>%
  ## here we match matching quantiles (e.g. lower_90 to upper_90)
  pivot_longer(
    lower_90:upper_90,
    names_to = c(".value", "quantile"),
    names_pattern = "(.+)_(.+)"
  )

## make plot
ggplot() +
  geom_ribbon(
    data = df_long,
    aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = date, y = median)
  ) +
  ## use label_parsed to allow subscript label
  facet_wrap(
    ~ variable,
    ncol = 1,
    scales = "free_y",
    labeller = as_labeller(c(R = "R[t]", doubling_time = "Doubling~time"), label_parsed),
    strip.position = 'left'
  ) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = NULL,
    alpha = "Credibel\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.background = element_blank(),
    strip.placement = 'outside'
  )

EpiEstim

To run EpiEstim, we need to provide data on daily incidence and specify the serial interval (i.e. the distribution of delays between symptom onset of primary and secondary cases).

Incidence data can be provided as a vector, a dataframe or an incidence object from the incidence package, and you can even distinguish between imports and locally acquired infections; see the documentation at ?estimate_R for further details. We will create an incidence object:

## get incidence from onset date
cases <- incidence(linelist$date_onset)

The package provides several options for specifying the serial interval, the details of which are provided in the documentation at ?estimate_R. We will cover two of them here.

Using serial interval estimates from the literature

Using the option method = "parametric_si", we can manually specify the mean and standard deviation of the serial interval in a config object created using the function make_config. We use a mean and standard deviation of 12.0 and 5.2, respectively, defined in this paper:

## make config
config_lit <- make_config(
  mean_si = 12.0,
  std_si = 5.2
)

We can then estimate Rt with the estimate_R function:

epiestim_res_lit <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_lit
)
## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.

and plot a summary of the outputs:

plot(epiestim_res_lit)

Using serial interval estimates from the data

As we have data on dates of symptom onset and transmission links, we can also estimate the serial interval from the linelist by calculating the delay between onset dates of infector-infectee pairs. As we did in the EpiNow2 section, we will use the get_pairwise function from the epicontacts package, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains chapter for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in onset dates between transmission pairs, calculated using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma from the epitrix package for this fitting procedure, as we require a discretised distribution.

## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))

We then pass this information to the config object, run EpiEstim again and plot the results:

## make config
config_emp <- make_config(
  mean_si = serial_interval$mu,
  std_si = serial_interval$sd
)

## run epiestim
epiestim_res_emp <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_emp
)
## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.
## plot outputs
plot(epiestim_res_emp)

Specifying estimation time windows

These default options will provide a weekly sliding estimate and might a warning that you are estimating Rt too early in the outbreak for a precise estimate. You can change this by setting a later start date for the estimation as shown below. Unfortunately, EpiEstim only provides a very clunky way of specifying these estimations times, in that you have to provide a vector of integers referring to the start and end dates for each time window.

## define a vector of dates starting on June 1st
start_dates <- seq.Date(
  as.Date("2014-06-01"),
  max(cases$dates) - 7,
  by = 1
) %>%
  ## subtract the starting date to convert to numeric
  `-`(min(cases$dates)) %>%
  ## convert to integer
  as.integer()

## add six days for a one week sliding window
end_dates <- start_dates + 6
  
## make config
config_partial <- make_config(
  mean_si = 12.0,
  std_si = 5.2,
  t_start = start_dates,
  t_end = end_dates
)

Now we re-run EpiEstim and can see that the estimates only start from June:

## run epiestim
epiestim_res_partial <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_partial
)

## plot outputs
plot(epiestim_res_partial)

Analysing outputs

The main outputs can be access via $R. As an example, we will create a plot of Rt and a measure of “transmission potential” given by the product of Rt and the number of cases reported on that day; this represents the expected number of cases in the next generation of infection.

## make wide dataframe for median
df_wide <- epiestim_res_lit$R %>%
  rename_all(clean_labels) %>%
  rename(
    lower_95_r = quantile_0_025_r,
    lower_90_r = quantile_0_05_r,
    lower_50_r = quantile_0_25_r,
    upper_50_r = quantile_0_75_r,
    upper_90_r = quantile_0_95_r,
    upper_95_r = quantile_0_975_r,
    ) %>%
  mutate(
    ## extract the median date from t_start and t_end
    dates = epiestim_res_emp$dates[round(map2_dbl(t_start, t_end, median))],
    var = "R[t]"
  ) %>%
  ## merge in daily incidence data
  left_join(as.data.frame(cases), "dates") %>%
  ## calculate risk across all r estimates
  mutate(
    across(
      lower_95_r:upper_95_r,
      ~ .x*counts,
      .names = "{str_replace(.col, '_r', '_risk')}"
    )
  ) %>%
  ## seperate r estimates and risk estimates
  pivot_longer(
    contains("median"),
    names_to = c(".value", "variable"),
    names_pattern = "(.+)_(.+)"
  ) %>%
  ## assign factor levels
  mutate(variable = factor(variable, c("risk", "r")))

## make long dataframe from quantiles
df_long <- df_wide %>%
  select(-variable, -median) %>%
  ## seperate r/risk estimates and quantile levels
  pivot_longer(
    contains(c("lower", "upper")),
    names_to = c(".value", "quantile", "variable"),
    names_pattern = "(.+)_(.+)_(.+)"
  ) %>%
  mutate(variable = factor(variable, c("risk", "r")))

## make plot
ggplot() +
  geom_ribbon(
    data = df_long,
    aes(x = dates, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = dates, y = median),
    alpha = 0.2
  ) +
  ## use label_parsed to allow subscript label
  facet_wrap(
    ~ variable,
    ncol = 1,
    scales = "free_y",
    labeller = as_labeller(c(r = "R[t]", risk = "Transmission~potential"), label_parsed),
    strip.position = 'left'
  ) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`50` = 0.7, `90` = 0.4, `95` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = NULL,
    alpha = "Credible\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.background = element_blank(),
    strip.placement = 'outside'
  )

Projecting incidence

EpiNow2

Besides estimating Rt, EpiNow2 also supports forecasting of Rt and projections of case numbers by integration with the EpiSoon package under the hood. All you need to do is specify the horizon argument in your epinow function call, indicating how many days you want to project into the future; see the EpiNow2 section under the “Estimating Rt” for details on how to get EpiNow2 up and running. In this section, we will just plot the outputs from that analysis, stored in the epinow_res object.

## define minimum date for plot
min_date <- as.Date("2015-03-01")

## extract summarised estimates
estimates <-  as_tibble(epinow_res$estimates$summarised)

## extract raw data on case incidence
observations <- as_tibble(epinow_res$estimates$observations) %>%
  filter(date > min_date)

## extract forecasted estimates of case numbers
df_wide <- estimates %>%
  filter(
    variable == "reported_cases",
    type == "forecast",
    date > min_date
  )

## convert to even longer format for quantile plotting
df_long <- df_wide %>%
  ## here we match matching quantiles (e.g. lower_90 to upper_90)
  pivot_longer(
    lower_90:upper_90,
    names_to = c(".value", "quantile"),
    names_pattern = "(.+)_(.+)"
  )

## make plot
ggplot() +
  geom_histogram(
    data = observations,
    aes(x = date, y = confirm),
    stat = 'identity',
    binwidth = 1
  ) +
  geom_ribbon(
    data = df_long,
    aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = date, y = median)
  ) +
  geom_vline(xintercept = min(df_long$date), linetype = 2) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = "Daily reported cases",
    alpha = "Credible\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14)

projections

The projections package developed by RECON makes it very easy to make short term incidence forecasts, requiring only knowledge of the effective reproduction number Rt and the serial interval. Here we will cover how to use serial interval estimates from the literature and how to use our own estimates them the linelist.

Using serial interval estimates from the literature

projections requires a discretised serial interval distribution of the class distcrete from the package distcrete. We will use a gamma distribution with a mean of 12.0 and and standard deviation of 5.2 defined in this paper. To convert these values into the shape and scale parameters required for a gamma distribution, we will use the function gamma_mucv2shapescale from the epitrix package.

## get shape and scale parameters from the mean mu and the coefficient of
## variation (e.g. the ratio of the standard deviation to the mean)
shapescale <- epitrix::gamma_mucv2shapescale(mu = 12.0, cv = 5.2/12)

## make distcrete object
serial_interval_lit <- distcrete::distcrete(
  name = "gamma",
  interval = 1,
  shape = shapescale$shape,
  scale = shapescale$scale
)

Here a quick check to make sure the serial interval looks correct. We access the density of the gamma distribution we have just defined by $d, which is equivalent to calling dgamma:

## check to make sure the serial interval looks correct
qplot(
  x = 0:50, y = serial_interval_lit$d(0:50), geom = "area",
  xlab = "Serial interval", ylab = "Density"
)

Using serial interval estimates from the data

As we have data on dates of symptom onset and transmission links, we can also estimate the serial interval from the linelist by calculating the delay between onset dates of infector-infectee pairs. As we did in the EpiNow2 section, we will use the get_pairwise function from the epicontacts package, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains chapter for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in onset dates between transmission pairs, calculated using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma from the epitrix package for this fitting procedure, as we require a discretised distribution.

## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))

## inspect estimate
serial_interval[c("mu", "sd")]
## $mu
## [1] 11.42459611835102
## 
## $sd
## [1] 7.619218456174097

Projecting incidence

To project future incidence, we still need to provide historical incidence in the form of an incidence object, as well as a sample of plausible Rt values. We will generate these values using the Rt estimates generated by EpiEstim in the previous section (under “Estimating Rt”) and stored in the epiestim_res_emp object. In the code below, we extract the mean and standard deviation estimates of Rt for the last time window of the outbreak (using the tail function to access the last element in a vector), and simulate 1000 values from a gamma distribution using rgamma. You can also provide your own vector of Rt values that you want to use for forward projections.

## create incidence object from dates of onset
inc <- incidence::incidence(linelist$date_onset)

## extract plausible r values from most recent estimate
mean_r <- tail(epiestim_res_emp$R$`Mean(R)`, 1)
sd_r <- tail(epiestim_res_emp$R$`Std(R)`, 1)
shapescale <- gamma_mucv2shapescale(mu = mean_r, cv = sd_r/mean_r)
plausible_r <- rgamma(1000, shape = shapescale$shape, scale = shapescale$scale)

## check distribution
qplot(x = plausible_r, geom = "histogram", xlab = expression(R[t]), ylab = "Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We then use the project() function to make the actual forecast. We specify how many days we want to project for via the n_days arguments, and specify the number of simulations using the n_sim argument.

## make projection
proj <- project(
  x = inc,
  R = plausible_r,
  si = serial_interval$distribution,
  n_days = 21,
  n_sim = 1000
)

We can then handily plot the incidence and projections using the plot() and add_projections() functions. We can easily subset the incidence object to only show the most recent cases by using the square bracket operator.

## plot incidence and projections
plot(inc[inc$dates > as.Date("2015-03-01")]) %>%
  add_projections(proj)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

You can also easily extract the raw estimates of daily case numbers by converting the output to a dataframe.

## convert to data frame for raw data
proj_df <- as.data.frame(proj)
proj_df

Resources

Survey analysis

THIS PAGE IS UNDER CONSTRUCTION

Overview

Preparation

Weighting

Random selection

Resources

Survival analysis

Overview

Survival analysis focuses on describing for a given individual or group of individuals, a defined point of event called the failure (occurrence of a disease, cure from a disease, death, relapse after response to treatment…) that occurs after a period of time called failure time (or follow-up time in cohort/population-based studies) during which individuals are observed. To determine the failure time, it is then necessary to define a time of origin (that can be the inclusion date, the date of diagnosis…).

The target of inference for survival analysis is then the time between an origin and an event. In current medical research, it is widely used in clinical studies to assess the effect of a treatment for instance, or in cancer epidemiology to assess a large variety of cancer survival measures.

It is usually expressed through the survival probability which is the probability that the event of interest has not occurred by a duration t.

Censoring: Censoring occurs when at the end of follow-up, some of the individuals have not had the event of interest, and thus their true time to event is unknown. We will mostly focus on right censoring here but for more details on censoring and survival analysis in general, you can see references.

Preparation

To run survival analyses in R, one the most widely used package is the survival package. We first install it and then load ot as well as the other packages that will be used in this section:

This page explores survival analyses using the linelist used in most of the previous pages and on which we apply some changes to have a proper survival data.

Loading dataset

We start by loading the linelist as you have done previously using the rio::import() function.

# import linelist
linelist_case_data <- rio::import("linelist_cleaned.xlsx")

Data management and transformation

In short, survival data can be described as having the following three characteristics:

  1. the dependent variable or response is the waiting time until the occurrence of a well-defined event,
  2. observations are censored, in the sense that for some units the event of interest has not occurred at the time the data are analyzed, and
  3. there are predictors or explanatory variables whose effect on the waiting time we wish to assess or control.

Thus, we will create different variables needed to respect that structure and run the survival analysis.

We define:

  • our event of interest as being “death” (hence our survival probability will be the probability of being alive after a certain time after the time of origin),
  • the follow-up time (futime) as the time between the time of onset and the time of outcome in days,
  • censored patients as those who recovered or for whom the final outcome is not known ie the event “death” was not observed (event=0).

CAUTION: Since in a real cohort study, the information on the time of origin and the end of the follow-up is known given individuals are observed, we will remove observations where the date of onset or the date of outcome is unknown. Also the cases where the date of onset is later than the date of outcome will be removed since they are considered as wrong.

TIP: Given that filtering to greater than (>) or less than (<) a date can remove rows with missing values, applying the filter on the wrong dates will also remove the rows with missing dates.

We then create from the var age_cat another variable age_cat_small that indicates reduces the categories of the age groups to 3.

#create a new data called linelist_surv from the linelist_case_data

linelist_surv <-  linelist_case_data %>% 
     
  dplyr::filter(
       # remove observations with wrong or missing dates of onset or date of outcome
       date_outcome > date_onset) %>% 
  
  dplyr::mutate(
       # create the event var which is 1 if the patient died and 0 if he was right censored
       event = ifelse(is.na(outcome) | outcome == "Recover", 0, 1), 
    
       # create the var on the follow-up time in days
       futime = as.double(date_outcome - date_onset), 
    
       # create a new age category variable with only 3 strata levels
       age_cat_small = dplyr::case_when( 
            age_years < 5  ~ "0-4",
            age_years >= 5 & age_years < 20 ~ "5-19",
            age_years >= 20   ~ "20+"),
       
       # previous step created age_cat_small var as character.
       # now convert it to factor and specify the levels.
       # Note that the NA values remain NA's and are not put in a level "unknown" for example,
       # since in the next analyses they have to be removed.
       age_cat_small = factor(age_cat_small,
                              levels = c("0-4", "5-19", "20+")))

TIP: We can verify the new variables we have created by doing a summary on the futime and a cross-tabulation between event and outcome from which it was created. Besides this verification it is a good habit communicating on the median follow-up time when interpreting survival analysis results.

summary(linelist_surv$futime)
##          Min.       1st Qu.        Median          Mean       3rd Qu.          Max. 
##  1.0000000000  6.0000000000 10.0000000000 11.9234006734 16.0000000000 64.0000000000
# cross tabulate the new event var and the outcome var from which it was created
# to make sure the code did what it was intended to
with(linelist_surv, 
     table(outcome, event, useNA = "ifany")
     )
##          event
## outcome      0    1
##   Death      0 2060
##   Recover 1617    0
##   <NA>    1075    0
# cross tabulate the new age_cat_small var and the age_cat var from which it was created,
# to make sure the code did what it was intended to

with(linelist_surv, 
     table(age_cat_small, age_cat, useNA = "ifany")
     ) 
##              age_cat
## age_cat_small 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ <NA>
##          0-4  851   0     0     0     0     0     0   0    0
##          5-19   0 899   733   619     0     0     0   0    0
##          20+    0   0     0     0   904   583    85  10    0
##          <NA>   0   0     0     0     0     0     0   0   68
# print the 10 first observations of the linelist_surv data looking at specific variables (including those newly created)

head(linelist_surv[,c("case_id", "age_cat_small", "date_onset","date_outcome","outcome","event","futime")], 10)
##    case_id age_cat_small date_onset date_outcome outcome event futime
## 1   a3c8b8           0-4 2014-05-08   2014-05-14 Recover     0      6
## 2   8689b7           0-4 2014-05-13   2014-05-18 Recover     0      5
## 3   11f8ea           20+ 2014-05-16   2014-05-30 Recover     0     14
## 4   893f25           20+ 2014-05-21   2014-05-29 Recover     0      8
## 5   be99c8          5-19 2014-05-22   2014-05-24 Recover     0      2
## 6   d0523a           0-4 2014-05-24   2014-06-05    <NA>     0     12
## 7   ce9c02           20+ 2014-05-27   2014-06-17   Death     1     21
## 8   275cc7           0-4 2014-05-27   2014-06-07   Death     1     11
## 9   07e3e8           20+ 2014-05-27   2014-06-01 Recover     0      5
## 10  2b8773          5-19 2014-06-06   2014-06-16    <NA>     0     10

We can also cross-tabule the variable age_cat_small and gender to have more details on the distribution of this new variable among the gender groups. For this we use the stat.table() function of the Epi package.

Epi::stat.table( 
  #give variables for the cross tabulation
  list(
    gender, 
    age_cat_small
    ),
  
  #precise the function you want to call (mean,count..)
  list( 
    count(),
    percent(age_cat_small)
    ), 
  
  #add margins
  margins=T, 
  
  #data used
  data = linelist_surv 
  )
##  ----------------------------------------- 
##          ----------age_cat_small---------- 
##  gender       0-4    5-19     20+   Total  
##  ----------------------------------------- 
##  f            498    1273     500    2271  
##              21.9    56.1    22.0   100.0  
##                                            
##  m            322     903    1029    2254  
##              14.3    40.1    45.7   100.0  
##                                            
##                                            
##  Total        851    2251    1582    4752  
##              18.2    48.1    33.8   100.0  
##  -----------------------------------------

Basics of survival analysis

Building a surv-type object

We will first use Surv() to build a standard survival object form the follow-up time and the event variables.The result of such a step is to produce an object of type survival that focuses on the time information by precising whether or not the event of interest (death) was observed. This is done using a “+” after the time in the print out of survobj that indicates right-censoring.

survobj <- with(linelist_surv, 
                
                survival::Surv(futime, event)
                
                )

#print the 50 first elements of the vector to see how it presents
head(survobj,50)
##  [1]  6+  5+ 14+  8+  2+ 12+ 21  11   5+ 10+  4  10+  3  10+ 11+  4   4  11  23+  4   9+ 14+  5+  9+  5  31+  8  11+ 13+  4  26+  7 
## [33]  6  14+  4  15  17+  8+ 18  12  14  12+  7   6+ 11  20+  2  22+  5   6+

Running initial analyses

We then start our analysis using the survfit() function to produce a survfit object, which fits the default calculations for Kaplan Meier (KM) estimates of the overall (marginal) survival curve, which are in fact a step function with jumps at observed event times. The final survfit object contains one or more survival curvesis and is created using the Surv object as a response variable in the model formul.

NOTE: The Kaplan-Meier estimate is a nonparametric maximum likelihood estimate (MLE) of the survival function. . (see resources for more information).

The summary of this survfit object will give what is called a life table that contains:

  • for each of the time of the follow-up (time) where an event happened and that are ascending ordered,
  • the number of people who were at risk of developing the event (people who did not have the event yet nor were censored: n.risk),
  • those who did develop it (n.event),
  • and from this, the probability of not developing the event (probability of not dying or of surviving past that specific time ).
  • Finally the standard error and the confidence interval for that probability are derived.
#fit the KM estimates using the formula where the previously Surv object "survobj" is the response variable. "~ 1" precises we run the model for the overall survival.

linelistsurv_fit <-  survival::survfit(
  survobj ~ 1
  )

#print its summary for more details
summary(linelistsurv_fit)
## Call: survfit(formula = survobj ~ 1)
## 
##  time n.risk n.event       survival          std.err   lower 95% CI   upper 95% CI
##     1   4752      30 0.993686868687 0.00114897074477 0.991437477210 0.995941363624
##     2   4713      76 0.977663061766 0.00214519789270 0.973467579110 0.981876626250
##     3   4596     156 0.944478675857 0.00333379261989 0.937967112583 0.951035443761
##     4   4365     209 0.899256214630 0.00440361825904 0.890666568481 0.907928699885
##     5   4069     222 0.850193821008 0.00525213128927 0.839961900928 0.860550380300
##     6   3744     217 0.800917095806 0.00591787354418 0.789401859425 0.812600308315
##     7   3362     194 0.754701177725 0.00643977231029 0.742184413673 0.767429034034
##     8   3019     178 0.710204056283 0.00686966758582 0.696866582986 0.723796798233
##     9   2694     152 0.670133151845 0.00721005347215 0.656149663429 0.684414648413
##    10   2402     113 0.638607320805 0.00745589465065 0.624159966029 0.653389086743
##    11   2161     128 0.600781436000 0.00772761063757 0.585824918228 0.616119804077
##    12   1911      96 0.570600892904 0.00792959526282 0.555268921644 0.586356207402
##    13   1666      60 0.550051040819 0.00807562808397 0.534448659604 0.566108908816
##    14   1497      44 0.533883875959 0.00819781923486 0.518055815781 0.550195527829
##    15   1342      32 0.521153410958 0.00830549455351 0.505126547084 0.537688781794
##    16   1193      48 0.500184958547 0.00850490443984 0.483790355114 0.517135139451
##    17   1038      29 0.486210619628 0.00865412288334 0.469541300918 0.503471720544
##    18    933      23 0.474224720109 0.00879426103942 0.457297766857 0.491778227365
##    19    828       8 0.469642838756 0.00885723562479 0.452599903667 0.487327536324
##    20    731       4 0.467072973701 0.00890148483539 0.449948206037 0.484849500086
##    21    649      13 0.457717120607 0.00909352744482 0.440236677304 0.475891658503
##    22    569       8 0.451281730510 0.00924593527072 0.433519056759 0.469772198286
##    23    502       6 0.445887924966 0.00939398273662 0.427851015289 0.464685216409
##    24    455       4 0.441968031120 0.00951366171674 0.423709465382 0.461013398311
##    25    397       4 0.437514952721 0.00967484370524 0.418957658367 0.456894223155
##    26    354       3 0.433807198885 0.00982682467983 0.414968277240 0.453501378601
##    27    312       1 0.432416791196 0.00989320456989 0.413454788483 0.452248435661
##    29    244       1 0.430644591232 0.01001013004335 0.411465303272 0.450717867297
##    38     75       1 0.424902663349 0.01140519989053 0.403126712063 0.447854900007

While using summary() we can add the option times and precise the specific times at which we want to see the survival information

#print its summary at specific times
summary(
  linelistsurv_fit,
        times=c(5,10,20,30,60)
        )
## Call: survfit(formula = survobj ~ 1)
## 
##  time n.risk n.event       survival          std.err   lower 95% CI   upper 95% CI
##     5   4069     693 0.850193821008 0.00525213128927 0.839961900928 0.860550380300
##    10   2402     854 0.638607320805 0.00745589465065 0.624159966029 0.653389086743
##    20    731     472 0.467072973701 0.00890148483539 0.449948206037 0.484849500086
##    30    219      40 0.430644591232 0.01001013004335 0.411465303272 0.450717867297
##    60      2       1 0.424902663349 0.01140519989053 0.403126712063 0.447854900007

We can also use the print() function. The print.rmean=TRUE argument is used to obtain the mean survival time and its standard error (se).

NOTE: The restricted mean survival time (RMST) is a specific survival measure more and more used in cancer survival analysis and which is often defined as the area under the survival curve given we observe patients up to restricted time T: more details in resources

#print the linelistsurv_fit object and ask for information on the mean survival time and its se. 
print(
  linelistsurv_fit, 
      print.rmean = TRUE
      )
## Call: survfit(formula = survobj ~ 1)
## 
##                 n            events            *rmean        *se(rmean)            median           0.95LCL           0.95UCL 
## 4752.000000000000 2060.000000000000   32.869929321153    0.524745149793   17.000000000000   16.000000000000   18.000000000000 
##     * restricted mean with upper limit =  64

TIP: We can create the surv object directly in the survfit() function and save a line of code. This will then give linelistsurv_quick <- survfit(Surv(futime, event) ~ 1, data=linelist_surv). But as you have seen, in such case we have to precise the data where the variables time and event are taken from.

Besides the summary() function, we can also use the str() function that gives more details on the structure of the survfit() object. Among those details is an important one: cumhaz which allows for instance to plot the cumulative hazard, with the hazard being the instantaneous rate of event occurrence (see references).

print(
  str(linelistsurv_fit)
      )
## List of 16
##  $ n        : int 4752
##  $ time     : num [1:60] 1 2 3 4 5 6 7 8 9 10 ...
##  $ n.risk   : num [1:60] 4752 4713 4596 4365 4069 ...
##  $ n.event  : num [1:60] 30 76 156 209 222 217 194 178 152 113 ...
##  $ n.censor : num [1:60] 9 41 75 87 103 165 149 147 140 128 ...
##  $ surv     : num [1:60] 0.994 0.978 0.944 0.899 0.85 ...
##  $ std.err  : num [1:60] 0.00116 0.00219 0.00353 0.0049 0.00618 ...
##  $ cumhaz   : num [1:60] 0.00631 0.02244 0.05638 0.10426 0.15882 ...
##  $ std.chaz : num [1:60] 0.00115 0.00218 0.00348 0.00481 0.00604 ...
##  $ type     : chr "right"
##  $ logse    : logi TRUE
##  $ conf.int : num 0.95
##  $ conf.type: chr "log"
##  $ lower    : num [1:60] 0.991 0.973 0.938 0.891 0.84 ...
##  $ upper    : num [1:60] 0.996 0.982 0.951 0.908 0.861 ...
##  $ call     : language survfit(formula = survobj ~ 1)
##  - attr(*, "class")= chr "survfit"
## NULL

Plotting Kaplan-Meir curves

Once the KM estimates are fitted, we can visualize that probability of being alive through the time using the basic plot() function that draws the so-known “Kaplan-Meier curve”. In other words the curve below is a conventional illustration of the survival experience in the whole patient group.

We can easily verify the follow-up time min and max on the curve.

An easy way to interpret it is to say that at time zero, all the participants are still alive: survival probability is then 100%. Then it decreases over time as patients die. The proportion of participants surviving past 60 days of f-u is around 40%.

plot(linelistsurv_fit, 
     xlab = "Days of follow-up",    #xaxis label
     ylab="Survival Probability",   #yaxis label
     main= "Overall survival curve" #figure title
     )

The confidence interval of the KM estimates of the survival are also plotted by default and can be dismissed by adding the option conf.int=FALSE to the plot() command.

Since the event of interest is “death”, drawing a curve describing the complements of the survival proportions will lead to drawing the cumulative mortality proportions.

plot(
     linelistsurv_fit,
     xlab = "Days of follow-up",       
     ylab="Survival Probability",       
     mark.time=TRUE,              #mark times of events to facilitate reading of the curve: a "+" sign is printed on the curve at every event
     conf.int=FALSE,             #do not plot the confidence interval
     main= "Overall survival curve and cumulative mortality"
     )



#draw an additional curve to the previous plot
lines( 
      linelistsurv_fit, 
      lty=3,          #use a different line type to differenciate between the two curves and for legend clarity purposes
      fun = "event", #draw the cumulative events instead of the survival 
      mark.time=FALSE, 
      conf.int=FALSE 
      )

#add a legend to the plot
legend("topright", #position of the legend in the plot
       legend=c("Survival","Cum. Mortality"), #legend text 
       lty = c(1,3), #line types to use in the legend, should follow linetype used to draw the two curves
       cex=.85, #factor that defines size of the legend text
       bty = "n" #no box type to be drawn for the legend
       )

Comparison of survival curves

To compare the survival within different groups of our observed participants or patients, we might need to first look at their respective survival curves and then run tests to evaluate the difference between independent groups. This comparison can concern groups based on gender, age, treatment, comorbidity…

Log rank test

The log rank test is a popular test that compares the entire survival experience between two or more independent groups and can be thought of as a test of whether the survival curves are identical (overlapping) or not (null hypothesis of no difference in survival between the groups). The survdiff() function of the survival package allows running the log-rank test when we specify rho=0 (which is the default). The test results gives a chi-square statistic along with a p-value since the log rank statistic is approximately distributed as a chi-square test statistic.

We first try to compare the survival curves by gender group. For this, we first try to visualize it (check whether the two survival curves are overlapping). A new survfit object will be created with a slightly different formula. Then the survdiff object will be created.

#create the new survfit object based on gender
linelistsurv_fit_sex <-  survfit(
  
              Surv(futime, event) ~ gender, #formula to create the survival curve: ~ gender indicates we no longer plot the overall survival but based on gender
              data = linelist_surv #data to use 
              )


#plot the survival curves by gender: have a look at the order of the strata level in the gender var before defining your colors
col_sex <- c("lightgreen", "darkgreen")

plot(linelistsurv_fit_sex,
     col=col_sex,
     xlab = "Days of follow-up", 
     ylab="Survival Probability"
     )

legend("topright", 
       legend=c("Female","Male"), 
       col =col_sex,
       lty = 1, cex=.9, bty = "n" 
       )

#compute the test of the difference between the survival curves
survival::survdiff(
          Surv(futime, event) ~ gender, 
          data = linelist_surv
         )
## Call:
## survival::survdiff(formula = Surv(futime, event) ~ gender, data = linelist_surv)
## 
## n=4525, 227 observations deleted due to missingness.
## 
##             N Observed       Expected      (O-E)^2/E      (O-E)^2/V
## gender=f 2271      997  976.621043836 0.425243605959 0.882404549082
## gender=m 2254      980 1000.378956164 0.415144532756 0.882404549082
## 
##  Chisq= 0.9  on 1 degrees of freedom, p= 0.3475439376

We see that the survival curve for women and the one for men overlap up to 15 days of follow-up and then women seem to have a slightly better survival. Yet the log-rank test does not gives enough evidence of a statistical difference between the survival for women and the survival for Men at \alpha= 0.05.

Some packages allow illustrating survival curves for different groups and testing the difference at once. Using the ggsurvplot() function from the survminer package, we can add in our curve the print of the risk tables for each group as well the p-value from the log-rank test.

We find back the p-value that was found in the previous step.

CAUTION: survminer functions require since the latest versions, specifying again the data used to fit the survival object. Remember doing this to avoid non-specific error messages.

survminer::ggsurvplot(
  
    linelistsurv_fit_sex, 
    data= linelist_surv, #precise again the data used to fit the linelistsurv_fit_sex even though it is already precised in that object
    conf.int = F, #do not show confidence interval of KM estimates
    surv.scale = "percent",  #present probabilities in the y axis in %
    break.time.by=10, #present the time axis with an increment of 10 days
    xlab = "Follow-up days", ylab= "Survival Probability",
    pval=T, pval.coord= c(40,.91),  #print p-value of Log-rank test and at the position with these coordinates
    risk.table=T,  #print the risk table 
    legend.title = "Gender",
    legend.labs = c("Female","Male"), font.legend = 10, #legend characteristics
    palette = "Dark2", #existing palette name precised,
    surv.median.line = "hv", #draw a line to the median survival
    ggtheme = theme_light()
)

We can then look for a difference in the source of the contamination. In this case, the Log rank test gives enough evidence of a difference in the survival probabilities at \alpha= 0.005. The survival probabilities for patients that got infected in funerals are higher than the survival probabilities for patients that got infected in other places, suggesting a survival benefit.

linelistsurv_fit_source <-  survfit(
              Surv(futime, event) ~ source,
              data = linelist_surv
              )

ggsurvplot( 
      linelistsurv_fit_source, data= linelist_surv,
      size=1, linetype = "strata",
      conf.int = T, 
      surv.scale = "percent",  
      break.time.by=10, 
      xlab = "Follow-up days", ylab= "Survival Probability",
      pval=T, pval.coord= c(40,.91),  
      risk.table=T,
      legend.title = "Source of \ninfection", legend.labs = c("Funeral","Other"), 
      font.legend = 10,
      palette = c("#E7B800","#3E606F"),
      surv.median.line = "hv", 
      ggtheme = theme_light()
)
## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.

Cox regression analysis

Cox proportional hazards regression is one of the most popular regression techniques for survival analysis. Other models can also be used since the Cox model requires important assumptions that need to be verified for an appropriate use such as the proportional hazards assumption: see references.

In a Cox proportional hazards regression model, the measure of effect is the hazard rate (HR), which is the risk of failure (or the risk of death in our example), given that the participant has survived up to a specific time. Usually, we are interested in comparing independent groups with respect to their hazards, and we use a hazard ratio, which is analogous to an odds ratio in the setting of multiple logistic regression analysis. The cox.ph() from the survival package is used to fit the model. The function cox.zph() from survival package may be used to test the proportional hazards assumption for a Cox regression model fit.

NOTE: A probability must lie in the range 0 to 1. However, the hazard represents the expected number of events per one unit of time.

  • If the hazard ratio for a predictor is close to 1 then that predictor does not affect survival,
  • if the HR is less than 1, then the predictor is protective (i.e., associated with improved survival),
  • and if the HR is greater than 1, then the predictor is associated with increased risk (or decreased survival).

Fitting a Cox model

We can first fit a model to assess the effect of age and gender on the survival. By just printing the model, we have the information on:

  • the estimated regression coefficients (coef) which quantifies the association between the predictors and the outcome,
  • their exponential (for interpretability, exp(coef)) which produces the hazard ratio,
  • their standard error (se(coef)),
  • the z-score: how many standard errors is the estimated coefficient away from 0,
  • and the p-value: the propability that the estimated coefficient could be 0.

The summary() function applied to the cox model object gives more info such as the confidence interval of the estimated HR and the different test scores.

The effect of the first covariate gender is presented in the first row. genderm is printed stating that the first strata level (“f”) i.e the female group is the reference group for the gender. Thus the interpretation of the test parameter is that of men compared to women. The p-value indicates there was no enough evidence of an effect of the gender on the expected hazard or of an association between gender and all-cause mortality.

The same lack of evidence is noted regarding age-group.

#fitting the cox model
linelistsurv_cox_sexage <-  survival::coxph(
              Surv(futime, event) ~ gender + age_cat_small, 
              data = linelist_surv
              )


#printing the model fitted
linelistsurv_cox_sexage
## Call:
## survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small, 
##     data = linelist_surv)
## 
##                                coef         exp(coef)          se(coef)        z       p
## genderm           -0.02993090204100  0.97051259167096  0.04641075941614 -0.64491 0.51898
## age_cat_small5-19 -0.08031961491896  0.92282135177524  0.06103986070195 -1.31586 0.18822
## age_cat_small20+  -0.09947669055336  0.90531105192223  0.06607520075858 -1.50551 0.13219
## 
## Likelihood ratio test=3.26  on 3 df, p=0.353582261511
## n= 4525, number of events= 1977 
##    (227 observations deleted due to missingness)
#summary of the model
summary(linelistsurv_cox_sexage)
## Call:
## survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small, 
##     data = linelist_surv)
## 
##   n= 4525, number of events= 1977 
##    (227 observations deleted due to missingness)
## 
##                                coef         exp(coef)          se(coef)        z Pr(>|z|)
## genderm           -0.02993090204100  0.97051259167096  0.04641075941614 -0.64491  0.51898
## age_cat_small5-19 -0.08031961491896  0.92282135177524  0.06103986070195 -1.31586  0.18822
## age_cat_small20+  -0.09947669055336  0.90531105192223  0.06607520075858 -1.50551  0.13219
## 
##                         exp(coef)     exp(-coef)       lower .95      upper .95
## genderm           0.9705125916710 1.030383334108 0.8861276015030 1.062933474811
## age_cat_small5-19 0.9228213517752 1.083633357720 0.8187671517776 1.040099429299
## age_cat_small20+  0.9053110519222 1.104592722995 0.7953430847214 1.030483720142
## 
## Concordance= 0.511  (se = 0.007 )
## Likelihood ratio test= 3.26  on 3 df,   p=0.353582262
## Wald test            = 3.3  on 3 df,   p=0.347276801
## Score (logrank) test = 3.3  on 3 df,   p=0.346971184

It was interesting to run the model and look at the results but a first look to verify whether the proportional hazards assumptions is respected could help saving time.

test_ph_sexage <- survival::cox.zph(linelistsurv_cox_sexage)
test_ph_sexage
##                        chisq df       p
## gender        0.198373236650  1 0.65604
## age_cat_small 0.224146419564  2 0.89398
## GLOBAL        0.347609634823  3 0.95084

NOTE: A second argument called method can be specified when computing the cox model. It is the determines how ties are handled. The default is “efron”, and the other options are “breslow” and “exact”.

In another model we add more risk factors such as the source of infection and the number of days between date of onset and admission. This time, we first verify the proportional hazards assumption before going forward.

In this model, we have included a continuous predictor (days_onset_hosp). In this case we interpret the parameter estimates as the increase in the expected log of the relative hazard for each one unit increase in the predictor, holding other predictors constant. We first verify the proportional hazards assumption. The graphical verification of this assumption may be performed with the function ggcoxzph() from the survminer package.

#fit the model
linelistsurv_cox <-  coxph(
                        Surv(futime, event) ~ gender + age_years+ source + days_onset_hosp,
                        data = linelist_surv
                        )


#test the proportional hazard model
linelistsurv_ph_test <- cox.zph(linelistsurv_cox)
linelistsurv_ph_test
##                           chisq df               p
## gender           0.241995730700  1        0.622768
## age_years        0.360342309537  1        0.548316
## source           2.816651205641  1        0.093291
## days_onset_hosp 34.394956562443  1 0.0000000044989
## GLOBAL          37.989980649635  4 0.0000001125905
survminer::ggcoxzph(linelistsurv_ph_test)

The model results indicates there is a negative association between onset to admission duration and all-cause mortality. The expected hazard is 0.9 times lower in a person who who is one day later admitted than another, holding gender constant. Or in a more straightforward explanation, a one unit increase in the duration of onset to admission is associated with a 10.7% (coef *100) decrease in the risk of death.

Results show also a positive association between the source of infection and the all-cause mortality. Which is to say there is an increased risk of death (1.21x) for patients that got a source of infection other than funerals.

#print the summary of the model
summary(linelistsurv_cox)
## Call:
## coxph(formula = Surv(futime, event) ~ gender + age_years + source + 
##     days_onset_hosp, data = linelist_surv)
## 
##   n= 2904, number of events= 1266 
##    (1848 observations deleted due to missingness)
## 
##                               coef          exp(coef)           se(coef)        z              Pr(>|z|)    
## genderm          0.008326822232359  1.008361586641964  0.058254813319130  0.14294              0.886339    
## age_years       -0.005077100640118  0.994935766050966  0.002372587855671 -2.13990              0.032363 *  
## sourceother      0.197068679425003  1.217827677538457  0.082123016069218  2.39968              0.016410 *  
## days_onset_hosp -0.114056693760552  0.892207371857420  0.014294046322273 -7.97931 0.0000000000000014715 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                       exp(coef)      exp(-coef)       lower .95       upper .95
## genderm         1.0083615866420 0.9917077497271 0.8995591096993 1.1303238202490
## age_years       0.9949357660510 1.0050900109553 0.9903198696454 0.9995731772219
## sourceother     1.2178276775385 0.8211342363488 1.0367704298610 1.4305040049971
## days_onset_hosp 0.8922073718574 1.1208156663379 0.8675583477701 0.9175567227758
## 
## Concordance= 0.573  (se = 0.008 )
## Likelihood ratio test= 89.67  on 4 df,   p=< 0.0000000000000002220446
## Wald test            = 73.91  on 4 df,   p=0.00000000000000338815131
## Score (logrank) test = 74.35  on 4 df,   p=0.00000000000000273116521

Forest plots

We can then visualize the results of the cox model using the practical forest plots with the ggforest() function of the survminer package.

ggforest(linelistsurv_cox, data = linelist_surv)

GIS basics

Overview

Spatial aspects of your data can provide a lot of insights into the situation of the outbreak, and to answer questions such as:

  • Where are the current disease hotspots?
  • How have the hotspots have changed over time?
  • How is the access to health facilities? Are any improvements needed?

In this section, we will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package.

Here are some example outputs:

Choropleth map

Case density heatmap

Health facility catchment areas

Preparation

Load packages

First, load the packages required for this analysis:

pacman::p_load(
  rio,          # to import data
  here,         # to locate files
  tidyverse,    # to clean, handle, and plot the data (includes ggplot2 package)
  sf,           # to manage spatial data using a Simple Feature format
  tmap,         # to produce simple maps, works for both interactive and static maps
  janitor,      # to clean column names
  OpenStreetMap # to add OSM basemap in ggplot map
  ) 

Sample case data

For demonstration purposes, we will work with a random sample of 1000 cases from the linelist dataframe (computationally, working with fewer cases is easier to display in this handbook).

First we import the dataframe using import() (see page on Import and export). It could be an Excel spreadsheet .xlsx, .csv, or in this case .rds which is an R data file.

# import clean case linelist
linelist <- import("linelist_cleaned.xlsx")  

Next we select a random sample of 1000 rows using sample() from base R.

# generate 1000 random row numbers, from the number of rows in linelist
sample_rows <- sample(nrow(linelist), 1000)

# subset linelist to keep only the sample rows, and all columns
linelist <- linelist[sample_rows,]

Now we want to convert this linelist which is class dataframe, to an object of class “sf” (spatial features). Given that the linelist has two columns “lon” and “lat” representing the longitude and latitude of each case’s residence, this will be easy.

We use the package sf (spatial features) and its function st_as_sf() to create the new object we call linelist_sf. This new object look essentially the same as the linelist, but the columns lon and lat have been designated as coordinate columns, and a coordinate reference system (CRS) has been assigned for when the points are displayed.

# Create sf object
linelist_sf <- linelist %>%
     sf::st_as_sf(coords = c("lon", "lat"), crs = 4326)

Admin boundary shapefiles

Sierra Leone: Admin boundary shapefiles

In advance, we have downloaded all administrative boundaries for Sierra Leone from the Humanitarian Data Exchange (HDX) website here.

Now we are going to do the following to save the Admin Level 3 shapefile in R:

  1. Import the shapefile
  2. Clean the column names
  3. Filter rows to keep only areas of interest

To import a shapefile we use the read_sf() function from sf. It is provided the filepath via here(). - in this case the file is within our R project in the “Data” and “shp” subfolders, with filename “sle_adm3.shp” (see pages on Import and export and R projects for more information).

sle_adm3_raw <- sf::read_sf(here::here("data", "shp", "sle_adm3.shp"))

Next we use clean_names() from the janitor package to standardize the column names of the shapefile. We also use filter() to keep only the rows with admin2name of “Western Area Urban” or “Western Area Rural”.

# ADM3 level clean
sle_adm3 <- sle_adm3_raw %>%
  janitor::clean_names() %>% # standardize column names
  filter(admin2name %in% c("Western Area Urban", "Western Area Rural")) # filter to keep certain areas

Below you can see the how the shapefile looks after import and cleaning. Scroll to the right to see how there are columns with admin level 0 (country), admin level 1, admin level 2, and finally admin level 3. Each level has a character name and a pcode unique identifier code. The pcode expands with each increasing admin level e.g. SL (Sierra Leone) -> SL04 (Western) -> SL0410 (Western Area Rural) -> SL040101 (Koya Rural).

Population data

Sierra Leone: Population by ADM3

Again, we import data that we have downloaded from HDX (link here). This time we use import() to load the .csv file. We also pass the imported file to clean_names() to standardize the column names.

# Population by ADM3
sle_adm3_pop <- rio::import(here::here("data/population", "sle_admpop_adm3_2020.csv")) %>%
  janitor::clean_names()

Here is what the populaton file looks like. Scroll to the right to see how each jurisdiction has columns with male population, female populaton, total population, and the population break-down in columns by age group.

Health Facilities

Sierra Leone: Health facility data from OpenStreetMap

Again we have downloaded the locations of health facilities from HDX here.

We import their shapefile with read_sf(), again clean the column names, and then filter to keep only the points tagged as either “hospital”, “clinic”, or “doctors”.

# OSM health facility shapefile
sle_hf <- sf::read_sf(here::here("data/shp", "sle_hf.shp")) %>% 
  janitor::clean_names() %>%
  filter(amenity %in% c("hospital", "clinic", "doctors"))

Here is the resulting dataframe here - scroll right to see the facility name and coordinates.

Plotting coordinates

The easiest way to plot X-Y coordinates (longitude/latitude, points) is to draw them as points directly from the linelist_sf object which we created in the preparation section.

The package tmap offers simple mapping capabilities for both static (“plot” mode) and interactive (“view” mode) with just a few lines of code. The tmap syntax is similar to that of *ggplot2**, such that commands are added to each other with +. Read more detail in this vignette.

  1. set the tmap mode. In this case we will use “plot” mode, which produces static outputs.
tmap_mode("plot") # choose either "view" or "plot"

Below, the points are plotted alone.tm_shape() is provided with the linelist_sf objects. We then add points via tm_dots(), specifying the size and color. Because linelist_sf is an sf object, we have already designated the two columns that contain the lat/long coordinates and the coordinate reference system (CRS):

# Just the cases (points)
tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue')

Alone, the points do not tell us much. So we should also map the administrative boundaries:

Again we use tm_shape() (see documentation) but instead of providing the case points shapefile, we provide the administrative boundary shapefile (polygons).

With the bbox = argument (bbox stands for “bounding box”) we can specify the coordinate boundaries. First we show the map display without bbox, and then with it.

# Just the administrative boundaries (polygons)
tm_shape(sle_adm3) +               # admin boundaries shapefile
  tm_polygons(col = "#F7F7F7") +   # show polygons in light grey
  tm_borders(col = "#000000",      # show borders with color and line weight
             lwd = 2) +
  tm_text("admin3name")            # column text to display for each polygon


# Same as above, but with zoom from bounding box
tm_shape(sle_adm3,
         bbox = c(-13.3, 8.43,    # corner
                  -13.2, 8.5)) +  # corner
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")

And now both points and polygons together:

# All together
tm_shape(sle_adm3, bbox = c(-13.3, 8.43, -13.2, 8.5)) +
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")+
tm_shape(linelist_sf) +
  tm_dots(size=0.08, col='blue') 

To read a good comparison of mapping options in R, see this blog post.

Spatial joins

Points in polygon

Spatial assign administrative units to cases

The case linelist does not contain any information about the administrative units of the cases. Although it is ideal to collect such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).

The sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types in this reference.

Below, we will spatially intersect our case locations (points) with the ADM3 boundaries (polygons):

  1. Begin with the linelist (points)
  2. Spatial join to the boundaries, setting the type of join at “st_intersects”
  3. Use select() to keep only certain of the new administrative boundary columns
linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3,   join = st_intersects)

All the columns from sle_adms have been added to the linelist! Each case now has columns detailing it’s administrative units. For this example, we only want to keep two of the new columns, so we select() the old column names and just the two additional of interest:

linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3, join = st_intersects) %>% 
  
  # Keep the old column names and two new admin ones of interest
  select(names(linelist_sf), admin3name, admin3pcod)

Below, just for display purposes you can see the first ten cases and that their admin level 3 (ADM3) jurisdictions that have been attached, based on where the point spatially intersected with the polygon shapes.

# Now you will see the ADM3 names attached to each case
linelist_adm %>% select(case_id, admin3name, admin3pcod)
## Simple feature collection with 1000 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -13.2711696807908 ymin: 8.44837585869944 xmax: -13.2054524923315 ymax: 8.49044210260147
## geographic CRS: WGS 84
## First 10 features:
##      case_id     admin3name admin3pcod                       geometry
## 4151  c98936       East III   SL040205 POINT (-13.2070840607282 8....
## 4033  98b277        West II   SL040207 POINT (-13.2232464328181 8....
## 2076  7d0cf6 Mountain Rural   SL040102 POINT (-13.2220128069607 8....
## 2399  db5dfa Mountain Rural   SL040102 POINT (-13.2239982636844 8....
## 4742  eeadcf     Central II   SL040202 POINT (-13.2387006452846 8....
## 914   054225        East II   SL040204 POINT (-13.2177366791379 8....
## 2499  62e038        East II   SL040204 POINT (-13.226206330081 8.4...
## 5595  5e585d Mountain Rural   SL040102 POINT (-13.2126391320792 8....
## 3242  a7cc25      Central I   SL040201 POINT (-13.2316789544686 8....
## 1984  5ee95c Mountain Rural   SL040102 POINT (-13.214949861693 8.4...

Now we can describe our cases by administrative unit - something we were not able to do before the spatial join!

# Make new dataframe containing counts of cases by administrative unit
case_adm3 <- linelist_adm %>%          # begin with linelist with new admin cols
  as_tibble() %>%                      # convert to tibble for better display
  group_by(admin3pcod, admin3name) %>% # group by admin unit, both by name and pcode 
  summarise(cases = n()) %>%           # summarize and count rows
  arrange(desc(cases))                     # arrange in descending order

case_adm3
## # A tibble: 10 x 3
## # Groups:   admin3pcod [10]
##    admin3pcod admin3name     cases
##    <chr>      <chr>          <int>
##  1 SL040102   Mountain Rural   304
##  2 SL040208   West III         209
##  3 SL040207   West II          180
##  4 SL040204   East II          102
##  5 SL040201   Central I         66
##  6 SL040203   East I            57
##  7 SL040206   West I            36
##  8 SL040205   East III          25
##  9 SL040202   Central II        18
## 10 <NA>       <NA>               3

We can also create a bar plot of case counts by administrative unit.

In this example, we begin the ggplot() with the linelist_adm, so that we can apply factor functions like fct_infreq() which orders the bars by frequency (see page on Factors for tips).

ggplot(
  data = linelist_adm,                       # begin with linelist containing admin unit info
  aes(x = fct_rev(fct_infreq(admin3name))))+ # x-axis is admin units, ordered by frequency (reversed)
  geom_bar()+                                # create bars, height is number of rows
  coord_flip()+                              # flip X and Y axes for easier reading of adm units
  theme_classic()+                           # simplify background
  labs(                                      # titles and labels
    x = "Admin level 3",
    y = "Number of cases",
    title = "Number of cases, by adminstative unit",
    caption = "As determined by a spatial join, from 1000 randomly sampled cases from linelist"
  )

Nearest neighbor

Finding the nearest health facility / catchment area

It might be useful to know where the health facilities are located in relation to the disease hot spots.

We can use the st_nearest_feature join method from the st_join() function (sf package) to visualize the closest health facility to individual cases.

  1. We begin with the shapefile linelist linelist_sf
  2. We spatially join with sle_hf, which is the locations of health facilities and clinics (points)
# Closest health facility to each case
linelist_sf_hf <- linelist_sf %>%                  # begin with linelist shapefile  
  st_join(sle_hf, join = st_nearest_feature) %>%   # data from nearest clinic joined to case data 
  select(case_id, osm_id, name, amenity)           # keep columns of interest, including id, name, type, and geometry of healthcare facility

We can see below (first 50 rows) that the each case now has data on the nearest clinic/hospital

We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.

# Count cases by health facility
hf_catchment <- linelist_sf_hf %>%    # begin with linelist including nearest clinic data
  as.data.frame() %>%                 # convert from shapefile to dataframe
  group_by(name) %>%                  # group by name of clinic
  summarise(case_n = n()) %>%         # count number of rows per clinic 
  arrange(desc(case_n))               # arrange in descending order

hf_catchment                          # print to console
## # A tibble: 8 x 2
##   name                                  case_n
##   <chr>                                  <int>
## 1 Shriners Hospitals for Children          360
## 2 Den Clinic                               334
## 3 GINER HALL COMMUNITY HOSPITAL            173
## 4 panasonic                                 48
## 5 Princess Christian Maternity Hospital     32
## 6 ARAB EGYPT CLINIC                         23
## 7 MABELL HEALTH CENTER                      16
## 8 <NA>                                      14

To visualize the results, we can use tmap - this time interactive mode for easier viewing

tmap_mode("view")   # set tmap mode to interactive  

# plot the cases and clinic points 
tm_shape(linelist_sf_hf) +            # plot cases
  tm_dots(size=0.08, col='name') +    # cases colored by closest clinic
tm_shape(sle_hf) +                    # plot clinic facilities  
  tm_dots(size=0.3, col='red') +      # red large dots
  tm_text("name") +                   # overlay with name of facility
tm_view(set.view = c(-13.2284, 8.4699, 13), # adjust zoom (center coords, zoom)
        set.zoom.limits = c(13,14))

Buffers

We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.

Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.

See more information about map projections and coordinate systems at this esri article.

First, create a circular buffer with a radius of ~2.5km around each health facility. This is done with the function st_buffer() from tmap. Because the units of the map is lat/long decimal degrees, that is how “0.02” is interpreted. If your map coordinate system is in meters, the number must be provided in meters.

sle_hf_2k <- sle_hf %>%
  st_buffer(dist=0.02)       # decimal degrees translating to approximately 2.5km 

Below we plot the buffer zones themselves:

tmap_mode("plot")
# buffers
tm_shape(sle_hf_2k) +
  tm_borders(col = "red", lwd = 2)

**Second, we intersect these buffers with the cases (points) using st_join() and the join type of st_intersects*. That is, the data from the buffers are joined to the points that they intersect with.

# Intersect the cases with the buffers
linelist_sf_hf_2k <- linelist_sf_hf %>%
  st_join(sle_hf_2k, join = st_intersects, left = TRUE) %>%
  filter(osm_id.x==osm_id.y | is.na(osm_id.y)) %>%
  select(case_id, osm_id.x, name.x, amenity.x, osm_id.y)

Now we can count the results: 189 out of 1000 cases did not intersect with any buffer (that value is missing), and so live more than 30 mins walk from the nearest health facility.

linelist_sf_hf_2k %>% 
  filter(is.na(osm_id.y)) %>% # empty column - did not join to any buffer
  nrow()
## [1] 189

We can visualize the results such that cases that did not intersect with any buffer appear in red.

tmap_mode("view")

# cases
tm_shape(linelist_sf_hf) +
  tm_dots(size=0.08, col='name') +
# buffers
tm_shape(sle_hf_2k) +
  tm_borders(col = "red", lwd = 2) +

# cases outside buffers
tm_shape(linelist_sf_hf_2k %>%  filter(is.na(osm_id.y))) +
  tm_dots(size=0.1, col='red') +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))

Other spatial joins

Alternative values for argument join include (from the documentation)

  • st_contains_properly
  • st_contains
  • st_covered_by
  • st_covers
  • st_crosses
  • st_disjoint
  • st_equals_exact
  • st_equals
  • st_is_within_distance
  • st_nearest_feature
  • st_overlaps
  • st_touches
  • st_within

Choropleth maps

Choropleth maps can be useful to visualize your data by pre-defined area, usually administrative unit or health area. In outbreak response this can help to target resource allocation for specific areas with high incidence rates, for example.

Now that we have the administrative unit names assigned to all cases (see section on spatial joins, above), we can start mapping the case counts by area (choropleth maps).

Since we also have population data by ADM3, we can add this information to the case_adm3 table created previously.

We begin with the dataframe created in the previous step case_adm3, which is a summary table of each administrative unit and its number of cases.

  1. The populaton data sle_adm3_pop are joined using a left_join() from dplyr on the basis of common values across column admin3pcod in the case_adm3 dataframe, and column adm_pcode in the sle_adm3_pop dataframe. See page on Joining data).
  2. select() is applied to the new dataframe, to keep only the useful columns - total is total population
  3. Cases per 10,000 populaton is calculated as a new column with mutate()
# Add population data and calculate cases per 10K population
case_adm3 <- case_adm3 %>% 
     left_join(sle_adm3_pop,                             # add columns from pop dataset
               by = c("admin3pcod" = "adm3_pcode")) %>%  # join based on common values across these two columns
     select(names(case_adm3), total) %>%                 # keep only important columns, including total population
     mutate(case_10kpop = round(cases/total * 10000, 3)) # make new column with case rate per 10000, rounded to 3 decimals

case_adm3                                                # print to console for viewing
## # A tibble: 10 x 5
## # Groups:   admin3pcod [10]
##    admin3pcod admin3name     cases  total case_10kpop
##    <chr>      <chr>          <int>  <int>       <dbl>
##  1 SL040102   Mountain Rural   304  33993       89.4 
##  2 SL040208   West III         209 210252        9.94
##  3 SL040207   West II          180 145109       12.4 
##  4 SL040204   East II          102  99821       10.2 
##  5 SL040201   Central I         66  69683        9.47
##  6 SL040203   East I            57  68284        8.35
##  7 SL040206   West I            36  60186        5.98
##  8 SL040205   East III          25 500134        0.5 
##  9 SL040202   Central II        18  23874        7.54
## 10 <NA>       <NA>               3     NA       NA

Join this table with the ADM3 polygons shapefile for mapping

case_adm3_sf <- case_adm3 %>%                 # begin with cases & rate by admin unit
  left_join(sle_adm3, by="admin3pcod") %>%    # join to shapefile data by common column
  select(objectid, admin3pcod,                # keep only certain columns of interest
         admin3name = admin3name.x,           # clean name of one column
         admin2name, admin1name,
         cases, total, case_10kpop,
         geometry) %>%                        # keep geometry so polygons can be plotted
  st_as_sf()                                  # convert to shapefile

Mapping the results

# tmap mode
tmap_mode("plot")               # view static map

# plot polygons
tm_shape(case_adm3_sf) + 
        tm_polygons("cases") +  # color by number of cases column
        tm_text("admin3name")   # name display

We can also map the incidence rates

# Cases per 10K population
tmap_mode("plot")             # static viewing mode

# plot
tm_shape(case_adm3_sf) +                # plot plygons
  tm_polygons("case_10kpop",            # color by column containing case rate
              breaks=c(0, 10, 50, 100), # define break points for colors
              palette = "Purples"       # use a purple color palette
              ) +
  tm_text("admin3name")                 # display text

Basemaps

OpenStreetMap

Below we describe how to achieve a basemap using OpenStreetMap features. Alternative methods include using ggmap which requires free registration with Google (details).

First we load the OpenStreetMap package, from which we will get our basemap.

Then, we create the object map, which we define using the function openmap() from OpenStreetMap package (documentation). We provide the following:

  • upperLeft and lowerRight Two coordinate pairs specifying the limits of the basemap tile
    • In this case we’ve put in the max and min from the linelist rows, so the map will respond dynamically to the data
  • zoom = (if null it is determined automatically)
  • type = which type of basemap - we have listed several possibilities here and the code is currently using the first one ([1]) “osm”
  • mergeTiles = we chose TRUE so the basetiles are all merged into one
# load package
pacman::p_load(OpenStreetMap)

# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(
  upperLeft = c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)),   # limits of basemap tile
  lowerRight = c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
  zoom = NULL,
  type = c("osm", "stamen-toner", "stamen-terrain","stamen-watercolor", "esri","esri-topo")[1])

If we plot this basemap right now, using autoplot.OpenStreetMap() from OpenStreetMap package, you see that the units on the axes are not latitude/longitude coordinates. It is using a different coordinate system. To correctly display the case residences (which are stored in lat/long), this must be changed.

autoplot.OpenStreetMap(map)

Thus, we want to convert the map to latitude/longitude with the openproj() function from OpenStreetMap package. We provide the basemap map and also provide the Coordinate Reference System (CRS) we want. We do this by providing the “proj.4” character string for the WGS 1984 projection, but you can provide the CRS in other ways as well. (see this page to better understand what a proj.4 string is)

# Projection WGS84
map_latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

Now when we create the plot we see that along the axes are latitude and longitude coordinate. The coordinate system has been converted. Now our cases will plot correctly if overlaid!

# Plot map. Must use "autoplot" in order to work with ggplot
autoplot.OpenStreetMap(map_latlon)

See the tutorials here and here for more info.

Contoured density heatmaps

Below we describe how to achieve a contoured density heatmap of cases, over a basemap, beginning with a linelist (one row per case).

  1. Create basemap tile from OpenStreetMap, as described above
  2. Plot the cases from linelist using the latitude and longitude columns
  3. Convert the points to a density heatmap with stat_density_2d() from ggplot2,

When we have a basemap with lat/long coordinates, we can plot our cases on top using the lat/long coordinates of their residence.

Building on the function autoplot.OpenStreetMap() to create the basemap, ggplot2 functions will easily add on top, as shown with geom_point() below:

# Plot map. Must be autoplotted to work with ggplot
autoplot.OpenStreetMap(map_latlon)+                 # begin with the basemap
  geom_point(                                       # add xy points from linelist lon and lat columns 
    data = linelist,                                
    aes(x = lon, y = lat),
    size = 1, 
    alpha = 0.5,
    show.legend = FALSE) +                          # drop legend entirely
  labs(x = "Longitude",                             # titles & labels
       y = "Latitude",
       title = "Cumulative cases")

The map above might be difficult to interpret, especially with the points overlapping. So you can instead plot a 2d density map using the ggplot2 function stat_density_2d(). You are still using the linelist lat/lon coordinates, but a 2D kernel density estimation is performed and the results are displayed with contour lines - like a topographical map. Read the full documentation here.

# begin with the basemap
autoplot.OpenStreetMap(map_latlon)+
  
  # add the density plot
  ggplot2::stat_density_2d(
        data = linelist,
        aes(
          x = lon,
          y = lat,
          fill = ..level..,
          alpha = ..level..),
        bins = 10,
        geom = "polygon",
        contour_var = "count",
        show.legend = F) +                          
  
  # specify color scale
  scale_fill_gradient(low = "black", high = "red")+
  
  # labels 
  labs(x = "Longitude",
       y = "Latitude",
       title = "Distribution of cumulative cases")

Time series heatmap

The density heatmap above shows cumulative cases. We can examine the outbreak over time and space by faceting the heatmap based on the month of symptom onset, as derived from the linelist.

We begin in the linelist, creating a new column with the Year and Month of onset. The format() function from base R changes how a date is displayed. In this case we want “YYYY-MM”.

# Extract month of onset
linelist <- linelist %>% 
  mutate(date_onset_ym = format(date_onset, "%Y-%m"))

# Examine the values 
table(linelist$date_onset_ym, useNA = "always")
## 
## 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04    <NA> 
##      12      14      45     104     200     197     152     102      74      43      31      26       0

Now, we simply introduce facetting via ggplot2 to the density heatmap. facet_wrap() is applied, using the new column as rows. We set the number of facet columns to 3 for clarity.

# packages
pacman::p_load(OpenStreetMap, tidyverse)

# begin with the basemap
autoplot.OpenStreetMap(map_latlon)+
  
  # add the density plot
  ggplot2::stat_density_2d(
        data = linelist,
        aes(
          x = lon,
          y = lat,
          fill = ..level..,
          alpha = ..level..),
        bins = 10,
        geom = "polygon",
        contour_var = "count",
        show.legend = F) +                          
  
  # specify color scale
  scale_fill_gradient(low = "black", high = "red")+
  
  # labels 
  labs(x = "Longitude",
       y = "Latitude",
       title = "Distribution of cumulative cases")+
  
  # facet the plot by month-year of onset
  facet_wrap(~ date_onset_ym, ncol = 4)               

V Data Vizualization

ggplot tips

Overview

ggplot2 is the most popular data visualisation package in R, and is generally used instead of base R for creating figures. ggplot2 benefits from a wide variety of supplementary packages that further enhance its functionality. Despite this, ggplot syntax is significantly different from base R plotting, and has a learning curve associated with it. Using ggplot2 generally requires the user to format their data in a way that is highly tidyverse compatible, which ultimately makes using these packages together very effective.

If you want inspiration for ways to creatively visualise your data, we suggest reviewing websites like the R graph gallery and Data-to-viz.

Preparation

Load data

Lets start by reading in the linelist data we’ll use for most of this section:

linelist_cleaned <- rio::import("linelist_cleaned.xlsx")

General cleaning

When preparing data to plot, it is best to make the data adhere to “tidy” data standards as much as possible. How to achieve this is expanded on in the data management pages of this handbook, such as Cleaning data and core functions.

Some simple ways we can prepare our data to make it better for plotting can often include making the contents of the data better for display - this does not necessarily mean its better for data manipulation! For example, we can replace NA values in a character column with the string “Unknown”, or clean some variables so that their “data friendly” with underscores etc are changed to normal text. Here are some examples of this in action:

linelist_cleaned <- linelist_cleaned %>%
  # make display version of columns with more friendly names
  mutate(
    # f to Male, f to Female, NA to Unknown
    gender_disp = case_when(gender == "m" ~ "Male",
                            gender == "f" ~ "Female",
                            is.na(gender) ~ "Unknown"),
    # replace NA with unknown for outcome
    outcome_disp = replace_na(outcome, "Unknown")
  )

Pivoting longer

As a matter of data structure, we often also want to pivot our data into longer formats, which will allow us to use a set of variables as a single variable. Read more about this is the page on Pivoting data.

For example, if we wanted to show the number of cases with specific symptoms, we are limited by the fact that each symptom is a specific column. We can pivot this to a longer format like this:

linelist_sym <- linelist_cleaned %>%
  pivot_longer(cols = c("fever", "chills", "cough", "aches", "vomit"),
               names_to = "symptom_name",
               values_to = "symptom_is_present") %>%
  mutate(symptom_is_present = replace_na(symptom_is_present, "unknown"))

Note that this format is not very useful for other operations, and should just be used for the plot it was made for. However, users should endeavor to use these practices as much as possible for the base dataset, as they are more tidyverse compliant, and will make working with the data easier.

Basics of ggplot

Plotting with ggplot2 is based on defining base attributes to a plot, and adding layers on top. In addition, the user can change various plot attributes like axis settings, colour schemes, and labels with additional objects that are “added” to the plot. While ggplot objects can be highly complex, the basic order of creating a ggplot looks something like this:

  1. Define base/default plot attributes and aesthetic swith ggplot() function
  2. Add geometric objects to the plot - i.e. is the plot a bar graph, a line plot, a scatter plot, or a histogram? Or is it a combination of these? These functions all start with geom_ as a prefix.
  3. Change plot aesthetics e.g. changing the axes, labels, colour scheme, background etc.

In code, this might look like this:

# define base plot attributes and dataset
ggplot(data = linelist_cleaned, mapping = aes(x = age)) +
  # add a geometric object with some parameters
  geom_histogram(binwidth = 10, fill = "red", col = "black") +
  # add labels to the axes
  labs(x = "Age in years", y = "Number of cases")
## Warning: Removed 87 rows containing non-finite values (stat_bin).

With this code, the most important things to note are:

  1. When making a ggplot, all objects are combined with a + sign.
  2. Understanding the principles behind aesthetic mapping with the mappping = aes() argument is essential to using ggplot. This can be done in the ggplot() function as well as every geometric object. Mapping with aes() is used to define which variables are assigned to each axis (these can be continuous or categorical variables). It is also used to define whether a variable can be used to create different plot aesthetics. This can apply to the:
a. line colour (`col = `)
b. filled colour (`fill = `)
c. linetype (e.g. dotted, dashed) (`linetype =`)
d. size of an object (`size = `)

This list is not exhaustive, but is enough to give a rough overview.

  1. Aesthetics of geometric objects can be defined explicitly as in the code above - this is different from assigning them to a variable. In cases where this is done, it must be outside the mapping argument.
# correct
ggplot(data = linelist_cleaned, mapping = aes(x = age)) +
  geom_histogram(col = "black")

# incorrect
# correct
ggplot(data = linelist_cleaned, mapping = aes(x = age)) +
  geom_histogram(mapping = aes(col = "black"))

An example of defining aesthetics with a variable can be seen here:

# read in dataset

# define base plot attributes and dataset
ggplot(data = linelist_cleaned, mapping = aes(x = age, fill = outcome)) +
  # add a geometric object with some parameters (NO FILL GIVEN)
  geom_histogram(binwidth = 10, col = "black") +
  # add labels to the axes
  labs(x = "Age in years", y = "Number of cases")
## Warning: Removed 87 rows containing non-finite values (stat_bin).

There are a huge number of different geoms that can be used, and they are all used with similar attribute names. While not exhaustive, some of the shapes that can be used are:

  1. Histograms - geom_histogram()
  2. Barcharts - geom_bar()
  3. Boxplots - geom_boxplot()
  4. Dot plots (for scatterplots or with discrete variables) - geom_point()
  5. Line graphs - geom_line() or geom_path()
  6. Trend lines - geom_smooth()

You can also add straight lines to your plot with geom_hline() (horizontal), geom_vline() (vertical) or geom_abline() (with a specified y intercept and slope)

There is much more detail we could show here, but we’ll finish with an example that ties these concepts together by plotting a correlation between height and weight of all the patients. We can also colour the points by age in years

# set up the plot and define key variables
# colour is the outcome
wt_ht_plot <- ggplot(data = linelist_cleaned,
                     aes(y = wt_kg, x = ht_cm, col = age_years)) +
  # define aspects of the geom that are NOT included specific to variables
  # other attributes are inherited
  geom_point(size = 1, alpha = 0.5) +
  # add a trend line
  # use a linear method
  geom_smooth(method = "lm")
wt_ht_plot
## `geom_smooth()` using formula 'y ~ x'

Themes and Labels

One of hte most important aspects of data visualisation is presenting data in a clear way with nice aesthetics. The plot we made previously looks ok, but we could make the theme a little nicer. ggplot2 comes with some preset themes that can be used to change the theme of the plot. We can also edit themes of the plot with extreme detail with the theme() function. We can also add some nicer labels to the plot with the labs() function. There are 5 standard labeling locations:

  1. x - the x-axis
  2. y - the y-axis
  3. title - the main plot title
  4. subtitle - directly underneath the plot title in smaller text (by default)
  5. caption - bottom of plot, on the right by default

For example, we can update the plot we previously plotted with nice labels like this:

wt_ht_plot <- wt_ht_plot + 
  # set the theme to classic
  theme_classic() +
  # further edit the theme to move the legend position
  # add nicer labels
  labs(y = "Weight (kg)", 
       x = "height (cm)",
       title = "Patient height and weight",
       subtitle = glue::glue("total patients {nrow(linelist_cleaned)}"),
       caption = "produced by me!")
wt_ht_plot
## `geom_smooth()` using formula 'y ~ x'

The theme() function can also be used to edit the defaults of these elements. This function can take an extremely large number of arguments, each of which can be used to edit very specific aspects of the plot. We won’t go through all examples, look at how editing aspects of text elements is done. The basic way this is done is:

  1. Calling the specific argument of theme() for the element we want to edit (e.g. plot.title for the plot title)
  2. Supplying the element_text() function to the argument (there are other versions of this e.g. element_rect() for editing the plot background aesthetics)
  3. Changing the arguments in element_text()

For example, we increase the size of the plot title with size, make the subtitle italicised with face, and right justify the caption with hjust. We’ll also change the legend location for good measure!

wt_ht_plot + 
    theme(legend.position = "bottom",
          # size of title is 30
          plot.title = element_text(size = 30),
          # right justify caption
          plot.caption = element_text(hjust = 0),
          # subtitle is italicised
          plot.subtitle = element_text(face = "italic"))
## `geom_smooth()` using formula 'y ~ x'

If you ever want to remove an element of a plot, you can also do it through theme()! Just pass element_blank() to an argument in theme to have it disappear completely!

Colour schemes

One thing that can initially be difficult to understand with ggplot2 is control of colour schemes when passing colour or fill as a variable rather than defining them explicitly within a geom. There are a few simple tricks that can be used to achieve this however. Remember that when setting colours, you can use colour names (as long as they are recognised) like "red", or a specific hex colour such as "#ff0505".

One of the most useful tricks is using manual scaling functions to explicity define colours. These are functions with the syntax scale_xxx_manual() (e.g. scale_colour_manual()). In this function you can explicitly define which colours map to which factor using the values argument. You can control the legend title with the name argument, and the order of factors with breaks.

If you want predefined palettes, you can use the scale_xxx_brewer or scale_xxx_viridis_y functions. The brewer functions can draw from colorbrewer.org palettes, and the viridis functions can draw from viridis (colourblind friendly!) palettes. Remember to define if the palette is discrete, continuous, or binned by specifying this at the end of the function (e.g. discrete is scale_xxx_viridis_d)

We can see this by using the symptom-specific dataframe we made in the previous section:

symp_plot <- ggplot(linelist_sym, aes(x = symptom_name, fill = symptom_is_present)) +
  # show as a portion of all
  geom_bar(position = "fill", col = "black") +
  theme_classic() +
  labs(
    x = "Symptom",
    y = "Symptom status (proportion)"
  )

symp_plot

symp_plot +
  scale_fill_manual(
    # explicitly define colours
    values = c("yes" = "black",
               "no" = "white",
               "unknown" = "grey"),
    # order the factors correctly
    breaks = c("yes", "no", "unknown"),
    # legend has no title
    name = ""
  ) 

symp_plot +
  scale_fill_viridis_d(
    breaks = c("yes", "no", "unknown"),
    name = ""
  )

Change order of discrete variables

Changing the order that discrete variables appear in is often difficult to understand for people who are new to ggplot2 graphs. It’s easy to understand how to do this however once you understand how ggplot2 handles discrete variables under the hood. Generally speaking, if a discrete varaible is used, it is automatically converted to a factor type - which orders factors by alphabetical order by default. To handle this, you simply have to reorder the factor levels to reflect the order you would like them to appear in the chart. For more detailed information on how to reorder factor objects, see the factor section of the guide.

We can look at a common example using age groups - by default the 5-9 age group will be placed in the middle of the age groups (given alphabetical order), but we can move it behind the 0-4 age group of the chart by releveling the factors.

# remove the instances of age_cat5 where data is missing
ggplot(linelist_cleaned %>%
         filter(!is.na(age_cat5)),
       # relevel the factor within the ggplot call (can do externally as well)
       aes(x = forcats::fct_relevel(age_cat5, "5-9", after = 1))) +
  geom_histogram(stat = "count") +
  labs(x = "Age group", y = "Number of hospitalisations",
       title = "Total hospitalisations by age group") +
  theme_minimal()

Multiple plots

Often its useful to show multiple graphs on one page, or one super-figure. There are a few ways to achieve this and a lot of packages that can help to facilitate it. However, while external packages are nice, it is often easier to use faceting as an alternative that is prebuilt into ggplot2. Faceting plots is extremely easy to do in terms of code, and produces plots with more predictable aesthetics - you wont have to wrangle legends and ensure that axes are aligned etc.

Faceting is a very specific way to obtain multiple plots - by definition, to facet you have to show the same type of plot in each facet, where every plot is specific to a level of a variable. This is done with one of two functions:

  1. facet_wrap() This is used when you want to show a different graph for each level of a single variable. One example of this could be showing a different epidemic curve for each hospital in a region.

  2. facet_grid() This is used when you want to bring a second variable into the faceting arrangement. Here each element of a grid is shows the intersection between an x or y element of a grid. For example, this could involve showing a different epidemic curve for each hospital in a region, shown horizontally, for each age group, shown vertically.

This can quickly become an overwhelming amount of information - its good to ensure you don’t have too many levels of each variable that you choose to facet by! Here are some quick examples with the malaria dataset:

malaria_data <- rio::import(here::here("data", "facility_count_data.rds")) 

# show a wrapped plot with facets by district

ggplot(malaria_data, aes(x = data_date, y = malaria_tot, fill = District)) +
  geom_bar(stat = "identity") +
  labs(
    x = "date of data collection",
    y = "malaria cases",
    title = "Malaria cases by district"
  ) +
  facet_wrap(~District) +
  theme_minimal()

We can also use a facet_grid() approach with the different age groups - we need to do some data transformations first however, as the age groups all are in their own columns - we want them in a single column. When you pass the two variables to facet_grid(), you can use formula notation (e.g. x ~ y) or wrap the variables in vars(). For reference, this: facet_grid(x ~ y) is equivalent to facet_grid(rows = vars(x), cols = vars(y)) Here’s how we can do this:

malaria_age <- malaria_data %>%
  pivot_longer(
    # choose all the columns that start with malaria rdt (age group specific)
    cols = starts_with("malaria_rdt_"),
    # column names become age group
    names_to = "age_group",
    # values to a single column (num_cases)
    values_to = "num_cases"
  ) %>%
  # clean up age group column - replace "malaria_rdt_" to leave only age group
  # then replace 15 with 15+
  # then refactor the age groups so they are in order
  mutate(age_group = str_replace(age_group, "malaria_rdt_", "") %>%
           ifelse(. == "15", "15+", .) %>%
           forcats::fct_relevel(., "5-14", after = 1))


# make the same plot as before, but show in a grid
ggplot(malaria_age, aes(x = data_date, y = num_cases, fill = age_group)) +
  geom_bar(stat = "identity") +
  labs(
    x = "date of data collection",
    y = "malaria cases",
    title = "Malaria cases by district and age group"
  ) +
  facet_grid(rows = vars(District), cols = vars(age_group)) +
  theme_minimal()

While faceting is a convenient approach to plotting, sometimes its not possible to get the results you want from its relatively restrictive approach. Here, you may choose to combine plots by sticking them together into a larger plot. There are three well known packages that are great for this - cowplot, gridExtra, and patchwork. However, these packages largely do the same things, so we’ll focus on cowplot for this section.

The cowplot package has a fairly wide range of functions, but the easiest use of it can be achieved through the use of plot_grid(). This is effectively a way to arrange predefined plots in a grid formation. We can work through another example with the malaria dataset - here we can plot the total cases by district, and also show the epidemic curve over time.

# bar chart of total cases by district
p1 <- ggplot(malaria_data, aes(x = District, y = malaria_tot)) +
  geom_bar(stat = "identity") +
  labs(
    x = "District",
    y = "Total number of cases",
    title = "Total malaria cases by district"
  ) +
  theme_minimal()

# epidemic curve over time
p2 <- ggplot(malaria_data, aes(x = data_date, y = malaria_tot)) +
  geom_bar(stat = "identity") +
  labs(
    x = "Date of data submission",
    y =  "number of cases"
  ) +
  theme_minimal()

cowplot::plot_grid(p1, p2,
                  # 1 column and two rows - stacked on top of each other
                   ncol = 1,
                   nrow = 2,
                   # top plot is 2/3 as tall as second
                   rel_heights = c(2, 3))

Smart Labeling

In ggplot2, it is also possible to add text to plots. However, this comes with the notable limitation where text labels often clash with data points in a plot, making them look messy or hard to read. There is no ideal way to deal with this in the base package, but there is a ggplot2 addon, known as ggrepel that makes dealing with this very simple!

The ggrepel package provides two new functions, geom_label_repel() and and geom_text_repel(), which replace geom_label() and geom_text(). Simply use these functions instead of the base functions to produce neat labels. You can also use the force argument to change the degree of repulsion between labels and their respective points.

For our example, we will make a scatterplot showing height against weight again. We’re also going to label each point with a patient id when the patient is over 70 years of age. We’ll use a trick with filter to only show these specific points!

pacman::p_load(ggrepel)

ggplot(linelist_cleaned, 
       aes(x = ht_cm,
           y = wt_kg)) +
  geom_point() + 
  # pass the filtered version of the dataset as a new dataset
  ggrepel::geom_label_repel(data = linelist_cleaned %>% filter(age_years > 70),
                           aes(label = case_id),
                           force = 1) +
  labs(y = "weight (kg)", x = "height(cm)")

Time axes

Working with time axes in ggplot can seem daunting, but is made very easy with a few key functions. Remember that when working with time or date that you should ensure that the correct variables are formatted as date or datetime class - see the working with dates section for more information on this.

The single most useful set of functions for working with dates in ggplot2 are the scale functions (scale_x_date(), scale_x_datetime(), and their cognate y-axis functions). These functions let you define how often you have axis labels, and how to format axis labels. To find out how to format dates, see the working with dates section again! You can use the date_breaks and date_labels arguments to specify how dates should look:

  1. date_breaks allows you to specify how often axis breaks occur - you can pass a string here (e.g. "3 months", or "2 days")

  2. date_labels allows you to define the format dates are shown in. You can pass a date format string to these arguments (e.g. "%b-%d-%Y"):

# make epi curve by date of onset when available
ggplot(linelist_cleaned, aes(x = date_onset)) +
  geom_bar(stat = "count") +
  scale_x_date(
    # 1 break every 1 month
    date_breaks = "1 months",
    # labels should show month then date
    date_labels = "%b %d"
  ) +
  theme_classic()

Highlighting

Highlighting specific elements in a chart is a useful way to draw attention to a specific instance of a variable while also providing information on the dispersion of the full dataset. While this is not easily done in base ggplot2, there is an external package that can help to do this known as gghighlight. This is easy to use within the ggplot syntax.

The gghighlight package uses the gghighlight() function to achieve this effect. To use this function, supply a logical statement to the function - this can have quite flexible outcomes, but here we’ll show an example of the age distribution of cases in our linelist, highlighting them by outcome.

# load gghighlight
library(gghighlight)


# replace NA values with unknown in the outcome variable
linelist_cleaned <- linelist_cleaned %>%
  mutate(outcome = replace_na(outcome, "Unknown"))

# produce a histogram of all cases by age
ggplot(linelist_cleaned, 
       aes(x = age_years, fill = outcome)) +
  geom_histogram() + 
  # highlight instances where the patient has died.
  gghighlight::gghighlight(outcome == "Death")

This also works well with faceting functions - it allows the user to produce facet plots with the background data highlighted that doesn’t apply to the facet!

# produce a histogram of all cases by age
ggplot(linelist_cleaned, 
       aes(x = age_years, fill = outcome)) +
  geom_histogram() + 
  # highlight instances where the patient has died.
  gghighlight::gghighlight() +
  facet_wrap(~outcome)

Dual axes

A secondary y-axis is often a requested addition to a ggplot2 graph. While there is a robust debate about the validity of such graphs in the data visualization community, and they are often not recommended, your manager may still want them. Below, we present two methods to achieve them.

  1. Using the cowplot package to combine two separate plots
  2. Using a statistical transformation of the data on the primary axis

Using cowplot

This approach involves creating two separate plots - one with a y-axis on the left, and the other with y-axis on the right. Both will use a specific theme_cowplot() and must have the same x-axis. Then in a third command the two plots are aligned and overlaid on top of each other. The functionalities of cowplot, of which this is only one, are described in depth at this site.

To demonstrate this technique we will overlay the epidemic curve with a line of the weekly percent of patients who died. We use this example because the alignment of dates on the x-axis is more complex than say, aligning a bar chart with another plot. Some things to note:

  • The epicurve and the line are aggregated into weeks prior to plotting and the date_breaks and date_labels are identical - we do this so that the x-axes of the two plots are the same when they are overlaid.
  • The y-axis is moved to the right-side for plot 2 with the position = argument of scale_y_continuous().
  • Both plots make use of theme_cowplot()

Note there is another example of this technique in the [Epicurves] page - overlaying cumulative incidence on top of the epicurve.

Make plot 1
This is essentially the epicurve. We use geom_area() just to demonstrate its use (area under a line, by default)

pacman::p_load(cowplot)            # load/install cowplot

p1 <- linelist %>%                 # save plot as object
     count(
       epiweek = lubridate::floor_date(date_onset, "week")) %>% 
     ggplot()+
          geom_area(aes(x = epiweek, y = n), fill = "grey")+
          scale_x_date(
               date_breaks = "month",
               date_labels = "%b")+
     theme_cowplot()+
     labs(
       y = "Weekly cases"
     )

p1                                      # view plot 

Make plot 2
Create the second plot showing a line of the weekly percent of cases who died.

p2 <- linelist %>%         # save plot as object
     group_by(
       epiweek = lubridate::floor_date(date_onset, "week")) %>% 
     summarise(
       n = n(),
       pct_death = 100*sum(outcome == "Death", na.rm=T) / n) %>% 
     ggplot(aes(x = epiweek, y = pct_death))+
          geom_line()+
          scale_x_date(
               date_breaks = "month",
               date_labels = "%b")+
          scale_y_continuous(
               position = "right")+
          theme_cowplot()+
          labs(
            x = "Epiweek of symptom onset",
            y = "Weekly percent of deaths",
            title = "Weekly case incidence and percent deaths"
          )

p2     # view plot

Now we align the plot using the function align_plots(), specifying horizontal and vertical alignment (“hv”, could also be “h”, “v”, “none”). We specify alignment of all axes as well (top, bottom, left, and right) with “tblr”. The output is of class list (2 elements).

Then we draw the two plots together using ggdraw() (from cowplot) and referencing the two parts of the aligned_plots object.

aligned_plots <- align_plots(p1, p2, align="hv", axis="tblr")                  # align the two plots and save them as list
aligned_plotted <- ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])  # overlay them and save the visual plot
aligned_plotted                                                                # print the overlayed plots

Statistical transformation

Unfortunately, secondary axes are not well supported in the ggplot syntax. For this reason, you’re fairly limited in terms of what can be shown with a secondary axis - the second axis has to be a direct transformation of the secondary axis.

Differences in axis values will be purely cosmetic - if you want to show two different variables on one graph, with different y-axis scales for each variable, this will not work without some work behind the scenes. To obtain this effect, you will have to transform one of your variables in the data, and apply the same transformation in reverse when specifying the axis labels. Based on this, you can either specify the transformation explicitly (e.g. variable a is around 10x as large as variable b) or calculate it in the code (e.g. what is the ratio between the maximum values of each dataset).

The syntax for adding a secondary axis is very straightforward! When calling a scale_xxx_xxx() function (e.g. scale_y_continuous()), use the sec.axis argument to call the sec_axis() function. The trans argument in this function allows you to specify the label transformation for the axis - provide this in standard tidyverse syntax.

For example, if we want to show the number of positive RDTs in the malaria dataset for facility 1, showing 0-4 year olds and all cases on chart:

# take malaria data from facility 1
malaria_facility_1 <- malaria_data %>%
  filter(location_name == "Facility 1")

# calculate the ratio between malaria_rdt_0-4 and malaria_tot 

tf_ratio <- max(malaria_facility_1$malaria_tot, na.rm = T) / max(malaria_facility_1$`malaria_rdt_0-4`, na.rm = T)

# transform the values in the dataset

malaria_facility_1 <- malaria_facility_1 %>%
  mutate(malaria_rdt_0_4_tf = `malaria_rdt_0-4` * tf_ratio)
  

# plot the graph with dual axes

ggplot(malaria_facility_1, aes(x = data_date)) +
  geom_line(aes(y = malaria_tot, col = "Total cases")) +
  geom_line(aes(y = malaria_rdt_0_4_tf, col = "Cases: 0-4 years old")) +
  scale_y_continuous(
    name = "Total cases",
    sec.axis = sec_axis(trans = ~ . / tf_ratio, name = "Cases: 0-4 years old")
  ) +
  labs(x = "date of data collection") +
  theme_minimal() +
  theme(legend.title = element_blank())

Resources

Inspiration ggplot graph gallery

Facets and labellers Using labellers for facet strips Labellers

Adjusting order with factors fct_reorder
fct_inorder
How to reorder a boxplot
Reorder a variable in ggplot2
R for Data Science - Factors

Legends
Adjust legend order

Captions Caption alignment

Cheatsheets
Beautiful plotting with ggplot2

TO DO - Under construction

Using option label_wrap_gen in facet_wrap to have multiple strip lines labels and colors of strips

Axis text vertical adjustment rotation Labellers

limit range with limit() and coord_cartesian(), ylim(), or scale_x_continuous() theme_classic()

expand = c(0,0) coord_flip() tick marks

ggrepel animations

remove remove title using fill = or color = in labs() flip order / don’t flip order move location color? theme(legend.title = element_text(colour=“chocolate”, size=16, face=“bold”))+ scale_color_discrete(name=“This color ischocolate!?”) Color of boxes behind points in legend theme(legend.key=element_rect(fill=‘pink’)) or use fill = NA to remove them. http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/ Change size of symbols in legend only guides(colour = guide_legend(override.aes = list(size=4)))

Turn off a layer in the legend geom_text(data=nmmaps, aes(date, temp, label=round(temp)), size=4) geom_text(data=nmmaps, aes(date, temp, label=round(temp), size=4), show_guide=FALSE)

Force a legend even if there is no aes(). ggplot(nmmaps, aes(x=date, y=o3))+ geom_line(aes(color=“Important line”))+ geom_point(aes(color=“My points”)) Control the shape in the legend with guides - a list with linetype and shape ggplot(nmmaps, aes(x=date, y=o3))+geom_line(aes(color=“Important line”))+ geom_point(aes(color=“Point values”))+ scale_colour_manual(name=’‘, values=c(’Important line’=‘grey’, ‘Point values’=‘red’), guide=‘legend’) + guides(colour = guide_legend(override.aes = list(linetype=c(1,0) , shape=c(NA, 16))))

Epidemic curves

An epidemic curve (also known as an “epi curve”) is a core epidemiological chart typically used to visualize the temporal pattern of illness onset among a cluster or epidemic of cases.

Analysis of the epicurve can reveal temporal trends, outliers, the magnitude of the outbreak, the most likely time period of exposure, time intervals between case generations, and can even help identify the mode of transmission of an unidentified disease (e.g. point source, continuous common source, person-to-person propogation). One online lesson on interpretation of epi curves can be found at the website of the US CDC.

In this page we demonstrate two approaches to producing epicurves in R:

  • The incidence package, which can produce an epi curve with fast and simple commands
  • The ggplot2 package, which allows for advanced customizability via more complex commands

The combination of these packages is also addressed, as are specific use-cases such as:

  • Plotting aggregated count data
  • Faceting or producing small-multiples
  • Applying moving averages
  • Showing which data are “tentative” or subject to reporting delays
  • Overlaying cumulative case incidence using a second axis

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(
  rio,          # file import/export
  here,         # relative filepaths 
  lubridate,    # working with dates/epiweeks
  aweek,        # alternative package for working with dates/epiweeks
  incidence,    # epicurves of linelist data
  stringr,      # search and manipulate character strings
  forcats,      # working with factors
  RColorBrewer, # Color palettes from colorbrewer2.org
  tidyverse     # data management + ggplot2 graphics
) 

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated epidemic
  • Aggregated counts by hospital from the same simulated epidemic

The datasets are imported using the import() function from the rio package. See the page on Import and export for various ways to import data. The linelist and aggregated counts are displayed below.

linelist <- rio::import("linelist_cleaned.xlsx")

Case linelist

The first 50 rows are displayed

Case counts aggregated by hospital

The first 50 rows are displayed

Set parameters

You may want to set editable parameters for production of a report, such as the date for which the data is current (the “data date”). You can then reference data_date in the code when applying filters or in captions that auto-update.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each relevant date column is class Date and has an appropriate range of values.

You can do this one-by-one using hist() for histograms, or range() with na.rm=TRUE.

An alternative is to use a for loop to print a histogram for each pre-defined date column.

# create character vector of column names 
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))

# Produce histogram of each date column
for (Col in DateCols) {     # open loop. iterate for each name in vector DateCols
  hist(linelist[, Col],     # print histogram of the column in linelist dataframe
       breaks = 50,         # number of breaks for the histogram
       xlab = Col)          # x-axis label is the name of the column
  }                         # close the loop

Epicurves with incidence package

Below we demonstrate how to make epicurves using the incidence package.

CAUTION: The incidence package currently expects data to be in a “linelist” format of one row per case (not aggregated counts). If your data is aggregated counts, read the section on aggregated data using ggplot2.

The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console. For further information see this incidence package vignette.

Simple example

2 steps are requires to plot an epicurve with the incidence package:

  1. Create an incidence object (using the function incidence())
    • Provide the case linelist
    • Specify the time interval into which the cases should be aggregated (daily, weekly, monthly..)
    • Specify any sub-groups
  2. Plot the incidence object
    • Specify labels, aesthetic themes, etc.

A simple example - an epicurve of daily cases:

# load incidence package
pacman::p_load(incidence)

# create the incidence object, aggregating cases by day
epi_day <- incidence(linelist$date_onset,  # the dataset and date column of interest
                     interval = "day")     # the time interval

# plot the incidence object
plot(epi_day)

Change time interval of case aggregation

The interval argument of incidence() defines how the observations are grouped into vertical bars. Some options are given below:

Argument option Further explanation
“week” note: Monday start day is default
“2 weeks” or 3, 4, 5…
“Sunday week” weeks beginning on Sundays
“2 Sunday weeks” or 3, 4, 5…
“MMWRweek” week starts on Sundays - see US CDC
“month” 1st of month
“quarter” 1st of month of quarter
“2 months” or 3, 4, 5…
“year” 1st day of calendar year

Below are examples of how different intervals look when applied to the linelist. The format and frequency of the date labels on the x-axis are the defaults for the specified interval.

# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk      <- incidence(linelist$date_onset, interval = "Monday week")

# Sunday week
epi_Sun_wk  <- incidence(linelist$date_onset, interval = "Sunday week")

# Three weeks (Monday weeks by default)
epi_3wk     <- incidence(linelist$date_onset, interval = "3 weeks")

# Monthly
epi_month   <- incidence(linelist$date_onset, interval = "month")


# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+     labs(title = "Monday weeks")
plot(epi_Sun_wk)+ labs(title = "Sunday weeks")
plot(epi_3wk)+    labs(title = "3 (Monday) weeks")
plot(epi_month)+  labs(title = "Months")

Filtered data

To plot the epicurve of a subset of data:

  1. Filter the linelist data
  2. Provide the filtered data to the incidence() command
  3. Plot the incidence object

The example below uses data filtered to show only cases at Central Hospital.

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot the incidence object
plot(central_outbreak) + labs(title = "Weekly case incidence at Central Hospital")

Modifications with plot()

An epicurve produced by incidence can be modified via these arguments within the plot() function.

  • show_cases = Logical; if TRUE, each case is shows as a box. Displays best on smaller outbreaks.
  • color = Color of case bars/boxes
  • border = Color of line around boxes, if show_cases = TRUE
  • alpha = Transparency of case bars/boxes (1 is fully opaque, 0 is fully transparent)
  • xlab = Title of x-axis
  • ylab = Title of y-axis - defaults to user-defined incidence time interval
  • labels_week = Logical; whether x-axis labels are in week format (YYYY-Www) or date format (YYYY-MM-DD), absent other modifications
  • n_breaks = Number of x-axis label breaks, absent other modifications
  • first_date & last_date Dates used to limit the date axis of the plot

Type ?plot.incidence in the R console for more details on each. Below is an example using some of the above arguments.

To further adjust plot appearance, see the section on using ggplot() to apply theme() arguments to the incidence plot.

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot incidence object
plot(central_outbreak,
     xlab = "Week of onset",
     ylab = "Week of onset",
     show_cases = TRUE,       # show each case as an individual box
     alpha = 0.5,
     color = "darkblue",
     border = "white")

Modifications with ggplot2

You can add ggplot2 modifications to the incidence plot by adding a + after the close of the incidence plot() function, as demonstrated below. See the ggplot2 section and page on ggplot tips for more options.

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot
plot(central_outbreak,         # plot with incidence package and arguments
     xlab = "Week of onset",
     ylab = "Weekly case incidence",
     show_cases = TRUE,
     alpha = 0.5,
     color = "darkblue",
     border = "black")+
  
  # Add modifications using ggplot() functions
  ############################################
  scale_x_date(            # convert to ggplot date scale (changes default label format)
    expand = c(0,0))+      # remove excess space on left and right
  
  scale_y_continuous(
    expand = c(0,0))+      # remove excess space below 0 on y-axis
  
  labs(
    title = "Incidence plot with ggplot() modifications",
    caption = stringr::str_glue(                            # dynamic caption - see page on characters and strings
      "n = {central_cases} from Central Hospital
      Case onsets range from {earliest_date} to {latest_date}. {missing_onset} cases are missing date of onset and not shown",
      central_cases = nrow(central_data),
      earliest_date = format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y'),
      latest_date = format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y'),      
      missing_onset = nrow(central_data %>% filter(is.na(date_onset)))))+
  
  theme_classic()+         # simplify background
  
  theme(
    axis.title = element_text(size = 12, face = "bold"), # axis titles larger and bold
    axis.text = element_text(size = 10, face = "bold"),  # axis text size and bold
    plot.caption = element_text(hjust = 0)               # move caption to left
  )

Group and color by values

To color cases by a value, provide the column to the groups = argument in the incidence() command.

In the example below the cases in the whole outbreak are colored by their age category. Note the use of incidence() argument na_as_group =. If TRUE (by default), missing values (NA) will form their own group. To adjust the legend title, add the ggplot2 function labs() as shown in the second plot, specifying a label for fill =.

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # Monday weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group


# plot the grouped incidence object
plot(age_outbreak)

# plot the grouped incidence object, specifying legend title
plot(age_outbreak)+
  labs(fill = "Age category")

Change colors

To specify colors manually, provide the name of the color or a character vector of colors to the argument color =. The number of colors listed must equal the number of groups (be aware of missing values as a group)

# weekly outbreak by hospital
hosp_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$hospital,
                               na_as_group = FALSE)   # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)

# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))

Change color palette

Use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).

Alternatively, adjust the palette with ggplot2 “fill” scales - see the ggplot tips page for details.

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group

# plot the epicurve with default palette
plot(age_outbreak)

# plot with different color palette
plot(age_outbreak, col_pal = rainbow)

Adjust level order

To adjust the order of group appearance (on plot and in legend), the grouping column must be class Factor. See the page on Factors for more information.

Below is an epicurve by gender, and the objective is to show “Missing” on the top of the epicurve, “Male” in the middle, and “Female” on the bottom - but to have the Legend in the reverse order so that Missing is on the bottom of the legend.

First, let’s see the plot with the default ordering:

# ORIGINAL - hospital NOT as factor
###################################

# create weekly incidence object, rows grouped by hospital and week
hospital_outbreak <- incidence(
  linelist$date_onset, 
  interval = "week", 
  groups = linelist$hospital)

# plot incidence object
plot(hospital_outbreak,
     show_cases = FALSE)+
  labs(title = "ORIGINAL - hospital not a factor")

Now, to make some changes to the levels we can do the following:

  • The package forcats is loaded, to work with factors
  • A dataset for plotting is defined in which:
    • the gender column is re-defined as a factor with as_factor()
    • missing values (NA) are converted to “Missing” with fct_explicit_na()
    • low-count groups are combined into “Other”, we leave the top 3 with fct_lump()
    • the order of levels are defined with “Other” and “Missing” first, so they appears at the top of the bars
  • The incidence object is created and plotted as before
    • Specify colors so that “Missing” is grey, change background to white
    • The order of the legend is reversed using guides() from ggplot2
# MODIFIED - hospital as factor
###############################

# load forcats package for working with factors
pacman::p_load(forcats)

# Convert hospital column to factor and adjust levels
plot_data <- linelist %>% 
  mutate(hospital = as_factor(hospital)) %>%                      # define as factor
  mutate(hospital = fct_explicit_na(hospital, "Missing")) %>%     # convert NA to "Missing" 
  mutate(hospital = fct_lump(hospital, n = 3)) %>%                # Keep 3 most frequent hospitals, with remaining combined into "Other" 
  mutate(hospital = fct_relevel(hospital, c("Missing", "Other"))) # Set "Missing" and "Other" as top levels


# Create weekly incidence object, grouped by hospital and week
hospital_outbreak_mod <- incidence(
  plot_data$date_onset, 
  interval = "week", 
  groups = plot_data$hospital)

# plot incidence object
plot(hospital_outbreak_mod,
     show_cases = FALSE,    # do NOT show box around each case
     color = c("grey", "beige", "darkgreen", "brown"))+   # specify colors                      
  
  # ggplot modifications     
  guides(fill = guide_legend(reverse = TRUE))+  # reverse order of legend only
  
  theme_classic()+

  # labels added via ggplot
  labs(
      title = "MODIFIED - hospital as factor",   # plot title
      subtitle = "Other & Missing at top of epicurve and bottom of legend and fewer categories",
      y = "Weekly case incidence",               # y axis title  
      x = "Week of symptom onset",               # x axis title
      fill = "Hospital")                         # title of legend     

Change legend

Add ggplot2 commands to the incidence plot, such as:

  • labs(fill = "Legend title") to change the legend title
  • theme(legend.title = element_blank()) to remove the legend title
  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • guides(fill = guide_legend(reverse = TRUE)) to reverse order of the legend

See the page of ggplot tips for more details on working with legends.

Date-axis labels/gridlines

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars

Modify the bars

The aggregation of data into bars occurs when you set the interval = when creating the incidence object. The options for interval include options like “day”, “Monday week”, “Sunday week”, “month”, “2 weeks”, etc, as described in an earlier section.

Modify date-axis labels (frequency & format)

If working with the incidence package, you have several options to make modifications to the date-axis labels:

  1. Add incidence package functions scale_x_incidence() and make_breaks()
  2. Add the ggplot2 function scale_x_date() and arguments such as date_breaks = and date_labels =
  3. Use a combination of the above

Option 1: Add scale_x_incidence() only

scale_x_incidence() is from the incidence package.

  • Advantages: Short code. Auto-adjusts weekly labels to the interval of incidence object (Monday, Sunday weeks, etc.)
  • Disadvantages: Cannot make fine adjustments to label format, nor to minor vertical grid-lines between labels
  • Syntax: Provide the name of the incidence object to ensure date labels align with specified interval (e.g. Sunday or Monday weeks)

Optional arguments:

  • Use n_breaks = to specify the number of date labels, which start from the beginning of the interval of the first case
    • Tip: for breaks every nth interval, use n_breaks = nrow(i)/n (where “i” is the incidence object name and “n” is a number)
  • Use labels_week = to adjust whether labels are formatted as weeks (YYYY-Www) or dates (YYYY-MM-DD)
    • One vertical gridline will appear per date label

Other notes:

  • If the interval is “month”, n_breaks and labels_week will behave differently
  • Adding ggplot2’s scale_x_date() to the plot will remove any labels created by scale_x_incidence
  • Type ?scale_x_incidence into the R console to see more information

See how in the plot below (with Sunday week interval), the first date label is 27 April 2014, which is the Sunday before the first case on May 1

# create weekly incidence object (Sunday weeks)
outbreak <- incidence(central_data$date_onset, interval = "Sunday week")

# plot with scale_x_incidence()
plot(outbreak)+
  scale_x_incidence(outbreak,             # name of incidence object
                    labels_week = FALSE,  # show dates instead of weeks
                    n_breaks = nrow(outbreak)/8) # breaks every 8 weeks from Sunday before first case

Option 2: scale_x_date() and make_breaks()

Add scale_x_date() from ggplot2, but also leverage make_breaks() from incidence:

  • Advantages: Best of both worlds: weekly labels auto-aligned to incidence interval, and you can make detailed adjustments to label format
  • Disadvantages: If your want minor grid-lines on Sunday-week date labels, they are not auto-aligned, see Option 3

Steps:

  1. Creating the incidence object
  2. Make a vector of date breaks using make_breaks(), which is similar to scale_x_incidence(). Provide the incidence object name and optionally n_breaks as described above.
  3. Add scale_x_date() to the incidence plot and use the following arguments:
  • breaks = provide the breaks vector you created with make_breaks(), by accessing the $breaks vector (see example below)
  • date_labels = you can make fine adjustments to date label format (e.g. “%d %b”) (use “” for new line)
  • date_minor_breaks = Sets frequency of minor gridlines, e.g. “weeks”. If using Sunday weeks and you want minor gridlines see Option 3.

Note how in the example below, the incidence object interval is Monday weeks, and the first date label is 28 April, the Monday before the first case reported 1 May.

# Break modification using scale_x_date() and make_breaks()
###########################################################
# make incidence object
outbreak <- incidence(central_data$date_onset, interval = "Monday week")

# make breaks
my_labels <-  make_breaks(outbreak, n_breaks = nrow(outbreak)/6) # breaks every 6 weeks

# plot
plot(outbreak)+
  scale_x_date(
    breaks            = my_labels$breaks, # use $breaks on the make_breaks() output
    date_labels       = "%d %b\n%Y",      # detailed adjustment to date label format
    date_minor_breaks = "weeks")          # vertical lines each week (only works for Monday week incidence objects)  

Option 3: Use scale_x_date() only

Add only scale_x_date() from ggplot2 to the incidence plot:

  • Advantages: Complete control over breaks, labels, gridlines, and plot width
  • Disadvantages: More code required, more opportunity to make mistakes

Syntax:

If your incidence intervals are days or Monday weeks it’s easy!:

  • Provide an interval for date labels to date_breaks = (e.g. “day”, “week”, “2 weeks”, “month”, “year”)
  • Provide an interval for minor vertical grid lines interval to date_minor_breaks =
# Date break modification using scale_x_date() only
###################################################

# make incidence object
outbreak <- incidence(central_data$date_onset, interval = "Monday week")

# plot
plot(outbreak)+
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # date labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # minor vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")  # date labels format 

If your incidence intervals are Sunday weeks, the code required is more complex - see below for a Sunday week example

  • Provide a sequence of Sunday dates to breaks = and to minor_breaks =
  • Use date_labels = for formatting (see Dates page for tips)
  • Add the argument expand = c(0,0) to start labels at the first incidence bar. Otherwise, the first label will shift depending on your specified label interval.

A Sunday week example

If you want a plot of Sunday weeks and also finely-adjusted label formats, you might find this code example helpful.
Here is an example of producing a weekly epicurve using incidence for Sunday weeks, with finely-adjusted date labels through ggplot2’s scale_x_date():

# load packages
pacman::p_load(tidyverse,  # for ggplot
               incidence,  # for epicurve
               lubridate)  # for floor_date() and ceiling_date()

# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset,
                              interval = "Sunday week") # equivalent to "MMWRweek" (see US CDC)

# plot() the incidence object
plot(central_outbreak)+                  
  
  ### ggplot() commands added to the plot
  
  # Date-axis 
  scale_x_date(
    
    # remove excess x-axis space below and after case bars
    expand = c(0,0),                 
    
    # date labels every 3 weeks, from Sunday before first case to Sunday after last case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    
    # grid-lines every week, from Sunday before first case to Sunday after last case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    # date labels format
    date_labels = "%d\n%b\n'%y")+       
  
  # Y-axis
  scale_y_continuous(
    expand = c(0,0))+                  # remove excess space under x-axis
  
  # Aesthetic themes
  theme_minimal()+                    # simplify background
  
  theme(
    axis.title = element_text(size = 12, face = "bold"),       # axis titles formatting
    plot.caption = element_text(face = "italic", hjust = 0))+  # caption formatting, left-aligned
  
  # Plot labels
  labs(x = "Week of symptom onset (Sunday weeks)", 
       y = "Weekly case incidence", 
       title = "Weekly case incidence at Central Hospital (Sunday weeks)",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Facets/small multiples

To facet the plot by a variable (make “small multiples”), you must do this via ggplot2. See the section on facets.

Other tips

DANGER: Be cautious if a static y-axis scale (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see section on aggregated data epicurves for more details.

Epicurves with ggplot2

Using the ggplot2 package alone to create an epicurve can offer more customizable plots, but also involves more code and potential for error:

Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to headaches.

These examples use a subset of the linelist dataset - only the cases from Central Hospital.

central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

Examples

To produce an epicurve with ggplot() there are three main elements:

  • A histogram, to aggregate the linelisted cases into “bins” and display bars reflecting the counts per bin
  • Scales for the axes and their associated labels
  • Themes for the plot appearance, including titles, labels, captions, etc.

Simplest examples

Below is perhaps the most simple code to produce daily and weekly epicurves.

# daily 
ggplot(data = central_data, aes(x = date_onset)) +  # x column must be class Date
  geom_histogram(binwidth = 1)+                     # cases binned by 1 day 
  labs(title = "Daily")

# weekly
ggplot(data = central_data, aes(x = date_onset)) +  
  geom_histogram(binwidth = 7)+                     # cases binned each 7 days, beginning from first case (!) 
  labs(title = "Weekly")

CAUTION: Using binwidth = 7 starts the first bin at the first case, which could be any day of the week! To create specific Monday or Sunday weeks, see guidance below .

Specify bin start dates

To create weekly epicurves where the bins begin on a specific day of the week (e.g. Monday or Sunday), specify the histogram bin breaks = manually (not with binwidth). This can be done by creating a sequence of dates using the seq.Date() function (base R):

This function expects to =, from =, and by = arguments as shown below.

seq.Date(
  from = as.Date("2015-01-01"),
  to = as.Date("2016-01-01"),
  by = "months")
##  [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01"
## [11] "2015-11-01" "2015-12-01" "2016-01-01"

You can start/end the sequence at a specific date, as shown above, or you can write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such dynamic weekly breaks is below:

# Sequence of dates from the Monday before the first case to the Monday after the last case, by week
seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
         to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
         by   = "7 days")
##  [1] "2014-04-28" "2014-05-05" "2014-05-12" "2014-05-19" "2014-05-26" "2014-06-02" "2014-06-09" "2014-06-16" "2014-06-23" "2014-06-30"
## [11] "2014-07-07" "2014-07-14" "2014-07-21" "2014-07-28" "2014-08-04" "2014-08-11" "2014-08-18" "2014-08-25" "2014-09-01" "2014-09-08"
## [21] "2014-09-15" "2014-09-22" "2014-09-29" "2014-10-06" "2014-10-13" "2014-10-20" "2014-10-27" "2014-11-03" "2014-11-10" "2014-11-17"
## [31] "2014-11-24" "2014-12-01" "2014-12-08" "2014-12-15" "2014-12-22" "2014-12-29" "2015-01-05" "2015-01-12" "2015-01-19" "2015-01-26"
## [41] "2015-02-02" "2015-02-09" "2015-02-16" "2015-02-23" "2015-03-02" "2015-03-09" "2015-03-16" "2015-03-23" "2015-03-30" "2015-04-06"
## [51] "2015-04-13" "2015-04-20" "2015-04-27" "2015-05-04"

Let’s unpack the rather daunting code above:

  • The “from” value (earliest date of the sequence) is created as follows: the minimum date value (min() with na.rm=TRUE) in the column date_onset is fed to floor_date() from the lubridate package. floor_date() uses the specified arguments to return the start date of that “week”, given that the start of each week is a Monday (week_start = 1).
  • Likewise, the “to” value (end date of the sequence) is created using the inverse function ceiling_date() to return the Monday after the last case.
  • The “by” argument of seq.Date() can be set to any number of days, weeks, or months.

These sequences of dates can be used to create histogram bin breaks, but also the breaks for the date labels, which may be independent from the bins. Read more about the date labels in later sections.

Below are detailed example codes to produce weekly epicurves for Monday weeks and for Sunday weeks.

Monday weeks example

Of note:

  • The break points of the histogram bins are specified manually to begin the Monday (week_start = 1) before the earliest case and to end the Monday after the last case (see explanation above).
  • The breaks for date labels on the x-axis are easy and use date_breaks = within scale_x_date() because we want Monday weeks. For Sunday weeks, see next example.
  • Minor vertical gridlines between date labels are made using date_minor_breaks = within scale_x_date(). Again, Sunday week alignment would use a slightly different method.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the axes, which also ensures the date labels begin at the first bar.
  • Color and fill of the bars are defined in geom_histogram()
# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    
    # bars
    color = "darkblue",   # color of lines around bars
    fill = "lightblue",   # color of fill within bars
  
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days")    # bins are 7-days
  )+ 
    
  
  # x-axis labels
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space before and after case bars
    date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+             # remove excess y-axis space below 0
  
  # aesthetic themes
  theme_minimal()+                # simplify plot background
  
  theme(
    plot.caption = element_text(face = "italic", # caption on left side in italics
                                hjust = 0), 
    axis.title = element_text(face = "bold"))+   # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases (Monday weeks)",
    subtitle = "Note alignment of bars, vertical lines, and axis labels on Mondays",
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Sunday weeks example

The below code achieves the same epicurve but uses Sunday weeks. Of note:

  • The break points of the histogram bins are specified manually to begin the Sunday (week_start = 7) before the earliest case and to end the Sunday after the last case (see explanation above).
  • Because the bins are not Monday weeks, the breaks for date labels on the x-axis and the vertical gridlines must be manually specified vectors of dates, as generated by seq.Date(). These date break vectors are given to breaks = and minor_breaks = within scale_x_date(). Unlike for Monday weeks, you cannot use the scale_x_date() arguments date_breaks and date_minor_breaks.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the axes, which also ensures the labels begin at the first bar.
  • Color and fill are defined in geom_histogram()
# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  
  # Histogram -
  geom_histogram(
    
    # bars
    color = "darkblue",   # color of lines around bars
    fill = "lightblue",   # color of fill within bars
    
    # manually specify bin break points: starts the Sunday before first case, ends Sunday after last case
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
      by   = "7 days")    # bins are 7-days
    )+ 
    
  
  # The labels on the x-axis
  scale_x_date(
    expand = c(0,0),
    
    # manually specify label breaks: starts the Sunday before first case, end Sunday after last case
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
      by   = "3 weeks"),
    
    # manually specify vertical gridline breaks: starts the Sunday before first case, end Sunday after last case
    minor_breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
      by   = "7 days"),
   
    # date label format
    date_labels = "%d\n%b\n'%y")+         # day, above month abbrev., above 2-digit year
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+                     # removes excess y-axis space below 0
  
  # aesthetic themes
  theme_minimal()+                               # a set of themes to simplify plot
  
  theme(
    plot.caption = element_text(face = "italic", # caption on left side in italics
                                hjust = 0), 
    axis.title = element_text(face = "bold"))+   # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases (Sunday weeks)",
    subtitle = "Note alignment of bars, vertical lines, and axis labels on Sundays",
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Group/color by value

The bars can be colored by group and “stacked”. To designate the column containing values for the groups, make the following changes. See the ggplot tips page for details.

  • Add the aesthetics argument aes() within geom_histogram()
  • Within aes(), provide the grouping column name to group = and fill = (no quotes needed).
  • Remove any fill = argument outside of aes(), as it will override the one inside
  • Arguments inside aes() will apply by group, whereas any outside will apply to all bars (e.g. you may want color = outside, so each bar has the same color border)
geom_histogram(aes(group = gender, fill = gender), color = "black")

Here it is applied in practice:

ggplot(data = plot_data) + 
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    aes(x = date_onset,
        group = hospital,
        fill = hospital),
    
    # bin breaks defined for Monday weeks
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    
    # Color around bars
    color = "black")

Adjust colors

  • To manually set the fill for each group, use scale_fill_manual() (note: scale_color_manual() is different!).
    • Use the values = argument to apply a vector of colors.
    • Use na.value = to specify a color for NA values.
    • To change the text of legend labels you can use the labels = argument in scale_fill_manual(), but it is dangerously easy to accidentally give colors incorrect legend text! Instead, it is recommended to change legend text by converting the grouping column to class Factor and adjusting its labels as described in the Factors page and briefly below.
  • To adjust the colors via a pre-defined color scale, see the page on ggplot tips.
ggplot(data = plot_data)+ 
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    aes(x = date_onset,
        group = hospital,
        fill = hospital),
    
    # bin breaks defined for Monday weeks
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    
    # Color around bars
    color = "black")+
  
  # manual specification of colors
  scale_fill_manual(
    values = c("grey", "black", "orange", "purple")) # specify fill colors ("values") - attention to order!

Adjust level order

Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order). See the page on Factors or ggplot tips for details.

Before making the plot, convert the grouping column to class Factor using as_factor() from the forcats package. Then you can make other adjustments to the levels, as detailed in the page on Factors.

# load forcats package for working with factors
pacman::p_load(forcats)

# Convert hospital column to factor and adjust levels
plot_data <- linelist %>% 
  mutate(hospital = as_factor(hospital)) %>%                      # define as factor
  mutate(hospital = fct_explicit_na(hospital, "Missing")) %>%     # convert NA to "Missing" 
  mutate(hospital = fct_lump(hospital, n = 3)) %>%                # Keep 3 most frequent hospitals, with remaining combined into "Other" 
  mutate(hospital = fct_relevel(hospital, c("Missing", "Other"))) # Set "Missing" and "Other" as top levels to appear on epicurve top

levels(plot_data$hospital)
## [1] "Missing"           "Other"             "Port Hospital"     "Military Hospital"

In the below plot, the only differences from previous is that column hospital has been consolidated as above, and we use guides() to reverse the legend order, so that “Missing” is on the bottom of the legend.

ggplot(plot_data) + 
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    aes(x = date_onset,
        group = hospital,
        fill = hospital),
    
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    
    color = "black")+  
    
  # x-axis labels
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space before and after case bars
    date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+             # remove excess y-axis space below 0
  
  # manual specification of colors
  scale_fill_manual(
    values = c("grey", "black", "orange", "purple"))+ # specify fill colors ("values") - attention to order!
  
  guides(fill = guide_legend(reverse = TRUE))+  # reverse order of legend only
  
  # aesthetic themes
  theme_minimal()+                # simplify plot background
  
  theme(
    plot.caption = element_text(face = "italic", # caption on left side in italics
                                hjust = 0), 
    axis.title = element_text(face = "bold"))+   # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases by hospital",
    subtitle = "3 most frequent values shown individually, plus 'Other'",
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    fill     = "Hospital")   # title of legend

Adjust legend

Read more about legends in the ggplot tips page. Here are a few highlights:

  • labs(fill = "Legend title") to edit the legend title
  • theme(legend.title = element_blank()) to have no title
  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • guides(fill = guide_legend(reverse = TRUE)) to reverse order of the legend

Bars side-by-side

Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge" (place this outside of aes()).

If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples). To improve readability in this example, missing gender values could be removed.

ggplot(central_data)+ 
    geom_histogram(
        aes(
          x = date_onset,
          group = gender,   # values grouped and colored by gender
          fill = gender),
        
        # bins start the Monday before first case, end Monday after last case
        breaks = seq.Date(
          from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
          to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
          by   = "7 days"), # bins are 7-days
        
        color = "black",                       # bar edge color
        position = "dodge")+                   # side-by-side bars
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+             # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("brown", "orange"),  # specify fill colors ("values") - attention to order!
                    na.value = "grey" )+     

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported")

Axis limits

You can set maximum and minimum date values using limits = c() within scale_x_date(). For example:

scale_x_date(limits = c(as.Date("2014-04-01"), NA)) # sets a minimum date but leaves the maximum open.  

Likewise, if you want to the x-axis to extend to a specific date (e.g. current date), even if no new cases have been reported, you can use:

scale_x_date(limits = c(NA, Sys.Date()) # ensures date axis will extend until current date  

CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian( xlim= c(), ylim=c() ) to your plot, which acts as a “zoom” without removing data.

DANGER: Be cautious setting the y-axis scale breaks or limits (e.g. 0 to 30 by 5: seq(0, 30, 5)). Such static numbers can cut-off your plot too short if the data changes to exceed the limit!.

Date-axis labels/gridlines

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars, but visually it can be important to align bins, date labels, and vertical grid lines.

To modify the date labels and grid lines, use scale_x_date() in one of these ways:

  • If your histogram bins are days, Monday weeks, months, or years:
    • Use date_breaks = to specify label frequency (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)
    • Use date_minor_breaks = to specify frequency of minor vertical gridlines between date labels
    • Add expand = c(0,0) to begin the labels at the first bar (otherwise, first label will shift forward depending on specified frequency)
    • Use date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)
  • If your histogram bins are Sunday weeks:
    • Use breaks = and minor_breaks = by providing a sequence of date breaks for each
    • You can still use date_labels = and expand for formatting as described above

Some notes:

  • See the opening ggplot section for instructions on how to create a sequence of dates using seq.Date().
  • If using aggregated counts (for example an “epiweek” x-axis) your x-axis may not be Date class, and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.
  • See this page or the Working with dates page for tips on creating date labels.

Demonstrations

Below is a demonstration of plots where the bins and the plot labels/grid lines are aligned and not aligned:

# 7-day binwidth defaults
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
  geom_histogram(
    binwidth = 7,                       # 7 days per bin (! starts at first case!)
    color = "darkblue",                 # color of lines around bars
    fill = "lightblue") +               # color of bar fill
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\ndefault axis labels/ticks not aligned")


# 7-day bins + Monday labels
#############################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,                 # 7-day bins with start at first case
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),               # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",       # Monday every 3 weeks
    date_minor_breaks = "week",    # Monday weeks
    date_labels = "%d\n%b\n'%y")+  # label format
  
  scale_y_continuous(
    expand = c(0,0))+              # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nDate labels and gridlines on Mondays")



# 7-day bins + Months
#####################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),                 # remove excess x-axis space below and after case bars
    date_breaks = "months",          # 1st of month
    date_minor_breaks = "week",      # Monday weeks
    date_labels = "%d\n%b\n'%y")+    # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nGridlines at 1st of each month (with labels) and weekly on Mondays\nLabels on 1st of each month")


# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  labs(
    title = "ALIGNED Mondays",
    subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")


# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),
    # date label breaks set to every 3 weeks beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    # gridlines set to weekly beginning Sunday before first case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    date_labels = "%d\n%b\n'%y")+  # label format
  
  labs(title = "ALIGNED Sundays",
       subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")

Faceting/small-multiples

As with other ggplots, you can create facetted plots (“small multiples”). As explained in the ggplot tips page of this handbook, you can use either:

  • facet_wrap()
  • facet_grid()

facet_wrap()

For epicurves, facet_wrap() is typically easiest as it is likely that you only need to facet on one column. The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the facetted plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the facetted plot.

Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).

Free axes
You will need to decide whether the scales of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). Do this with the scales = argument within facet_wrap() by specifying “free_x” or “free_y”, or “free”.

Number of cols and rows of facets
This can be specified with ncol = and nrow = within facet_wrap().

Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.

Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:

  • strip.text = element_text() (size, colour, face, angle…)
  • strip.background = element_rect() (e.g. element_rect(fill=“red”))
  • strip.position = (position of the strip “bottom”, “top”, “left”, or “right”)

Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.

Make a labeller like this, using the function as_labeller() from ggplot2. Then provide the labeller to the labeller = argument of facet_wrap() as shown below.

my_labels <- as_labeller(c(
     "0-4"   = "Ages 0-4",
     "5-9"   = "Ages 5-9",
     "10-14" = "Ages 10-14",
     "15-19" = "Ages 15-19",
     "20-29" = "Ages 20-29",
     "30-49" = "Ages 30-49",
     "50-69" = "Ages 50-69",
     "70+"   = "Over age 70"))

An example facetted plot - facetted by column age_cat.

# make plot
###########
ggplot(central_data) + 
  
  geom_histogram(
        aes(
          x = date_onset,
          group = age_cat,
          fill = age_cat),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        
        # histogram breaks set to 7 days beginning Monday before first case
        breaks = seq.Date(
          from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
          to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
          by   = "7 days"))+
                      
    
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(face = "bold", size = 10),
        strip.background = element_rect(fill = "grey"))+               # axis titles in bold
  
  # create facets
  facet_wrap(~age_cat,
             ncol = 4,
             strip.position = "top",
             labeller = my_labels)+             
  
  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

See this link for more information on labellers.

Total epidemic in facet background

To show the total epidemic in the background of each facet, add a separate geom_histogram() command before the current one. Specify that the data used in this histogram is your data without the column used for faceting (use select()). Then, specify a color like “grey” and a degree of transparency to make it appear in the background.

geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+

Note that the y-axis maximum is now based on the height of the entire epidemic.

ggplot(central_data, aes(x = date_onset)) + 
  
  # for background shadow of whole outbreak
  geom_histogram(
    data = select(central_data, -age_cat),
    color = "grey",
    alpha = 0.5)+

  # actual epicurves by group
  geom_histogram(
    aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
    color = "black",                       # arguments outside aes() apply to all data
    
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = seq.Date(
          from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
          to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
          by   = "7 days"))+                  
  
  # Labels on x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "2 months",     # labels appear every 2 months
    date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
    date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+  # removes excess y-axis space below 0
  
  # aesthetic themes
  theme_minimal()+                                           # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"),
    legend.position = "bottom",
    strip.text = element_text(face = "bold", size = 10),
    strip.background = element_rect(fill = "white"))+        # axis titles in bold
  
  # create facets
  facet_wrap(
    ~age_cat,                          # each plot is one value of age_cat
    ncol = 4,                          # number of columns
    strip.position = "top",            # position of the facet title/strip
    labeller = my_labels)+             # labeller defines above
  
  # labels
  labs(
    title    = "Weekly incidence of cases, by age category",
    subtitle = "Subtitle",
    fill     = "Age category",                                      # provide new title for legend
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

One facet with ALL data

If you want to have one facet box that contains all the data, complete the following steps. To do this, duplicate the entire dataset so that the number of rows doubles. There will also be a new column, in this case called facet. In this column, the duplicated rows will have the value “all”, and in the original rows will be the value of the faceting column - thus you have created a new column to facet on in which one of the unique values contains all the data from the original dataset. A “helper” function CreateAllFacet() below assist with this:

# Define helper function
CreateAllFacet <- function(df, col){
     df$facet <- df[[col]]
     temp <- df
     temp$facet <- "all"
     merged <-rbind(temp, df)
     
     # ensure the facet value is a factor
     merged[[col]] <- as.factor(merged[[col]])
     
     return(merged)
}

Now apply the helper function to the dataset, on column age_cat:

# Create dataset that is duplicated and with new column "facet" to show "all" age categories as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
  mutate(
    facet = factor(facet,
                  levels = c("all", "0-4", "5-9",
                             "10-14", "15-19", "20-29",
                             "30-49", "50-69", "70+")))

# check
table(central_data2$facet, useNA = "always")
## 
##   all   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##   454    74    82    63    55    99    59    10     3     9

Notable changes to the ggplot() command are:

  • The data used is now central_data2 (double the rows, with new column “facet”)
  • Labeller will need to be updated, if used
  • Optional: to achieve vertically stacked facets: the facet column is moved to rows side of equation and on right is replaced by “.” (facet_wrap(facet~.)), and ncol = 1. You may also need to adjust the width and height of the saved png plot image (see ggsave() in ggplot tips).
ggplot(central_data2, aes(x = date_onset)) + 
  
  # actual epicurves by group
  geom_histogram(
        aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        
        # histogram breaks set to 7 days beginning Monday before first case
        breaks = seq.Date(
          from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
          to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
          by   = "7 days"))+
                      
  # Labels on x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom")+               
  
  # create facets
  facet_wrap(facet~. ,                            # each plot is one value of facet
             ncol = 1)+            

  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Moving averages

See the page on Moving averages for detailed description and several options. Below is one option for calculating moving averages with the package slider.
* Plot the pre-calculated moving average:
* Aggregate the data as necessary (daily, weekly, etc.)
* Calculate the moving average
* Add the moving average to the ggplot (e.g. with geom_line())

Using slider

In this approach, the moving average is calculated in the dataset prior to plotting:

  • Within mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.
  • In the ggplot(), a geom_line() is added after the histogram, reflecting the moving average.

See the helpful online vignette for the slider package

# load package
pacman::p_load(slider)  # slider used to calculate rolling averages

# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>% 
  ## count cases by date
  count(date_onset,
        name = "new_cases") %>%   # name of new column
  filter(!is.na(date_onset)) %>%  # remove cases with missing date_onset
  
  ## calculate the average number of cases in the preceding 7 days
  mutate(
    avg_7day = slider::slide_index(    # create new column
      new_cases,                       # calculate based on value in new_cases column
      .i = date_onset,                 # index is date_onset col, so non-present dates are included in window 
      .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
      .before = 6,                     # window is the day and 6-days before
      .complete = FALSE),              # must be FALSE for unlist() to work in next step
    avg_7day = unlist(avg_7day))


# plot
######
ggplot(data = ll_counts_7day, aes(x = date_onset)) +
    geom_histogram(aes(y = new_cases),
                   fill="#92a8d1",
                   stat = "identity",
                   position = "stack",
                   colour = "#92a8d1")+ 
    geom_line(aes(y = avg_7day, lty = "7-day \nrolling avg"),
              color="red",
              size = 1) + 
    scale_x_date(date_breaks = "1 month",
                 date_labels = '%d/%m',
                 expand = c(0,0)) +
    scale_y_continuous(expand = c(0,0),
                       limits = c(0, NA)) + 
    labs(x="",
         y ="Number of confirmed cases",
         fill = "Legend")+ 
    theme_minimal()+
    theme(legend.title = element_blank())  # removes title of legend

Tentative data

The most recent data shown in epicurves should often be marked as tentative, or subject to reporting delays. This can be done in by adding a vertical line and/or rectangle over a specified number of days. Here are two options:

  1. Use annotate():
    • Pros: Transparency of rectangle is easy to adjust. Cons: Items will not appear in legend.
    • For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.
    • For a rectangle use annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.
  2. Use geom_segment() and geom_rect():
    • Pros: Items can easily appear in legend. Cons: Difficult to achieve semi-transparency of rectangle.
    • Provide the same x/y arguments as noted above for annotate()

CAUTION: While you can use geom_rect() to draw a rectangle, adjusting the transparency (alpha) does not work in a linelist context. This function overlays a rectangle for each observation/row!. Try a very low alpha (e.g. 0.01), or use annotate(geom = "rect") as shown.

Using annotate()

  • Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.
  • Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks
  • Here is an annotate() online example
ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "1 month",           # 1st of month
    date_minor_breaks = "1 month",     # 1st of month
    date_labels = "%b\n'%y")+          # label format
  
  # labels and theme
  labs(title = "Using annotate()\nRectangle and line showing that data from last 21-days are tentative",
    x = "Week of symptom onset",
    y = "Weekly case indicence")+ 
  theme_minimal()+
  
  # add semi-transparent red rectangle to tentative data
  annotate("rect",
           xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
           xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
           ymin  = 0,
           ymax  = Inf,
           alpha = 0.2,          # alpha easy and intuitive to adjust using annotate()
           fill  = "red")+
  
  # add black vertical line on top of other layers
  annotate("segment",
           x     = max(central_data$date_onset, na.rm = T) - 21, # 21 days before last data
           xend  = max(central_data$date_onset, na.rm = T) - 21, 
           y     = 0,         # line begins at y = 0
           yend  = Inf,       # line to top of plot
           size  = 2,         # line size
           color = "black",
           lty   = "solid")+   # linetype e.g. "solid", "dashed"

  # add text in rectangle
  annotate("text",
           x = max(central_data$date_onset, na.rm = T) - 15,
           y = 15,
           label = "Subject to reporting delays",
           angle = 90)

The same black vertical line can be achieved with the code below, but using geom_vline() you lose the ability to control the height:

geom_vline(xintercept = max(central_data$date_onset, na.rm = T) - 21,
           size = 2,
           color = "black")

Using geom_segment() and geom_rect()

In this alternative method, the red color is explained in the legend.

ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  # labels and theme
  labs(title = "Using geom_segment() and geom_rect()\nRectangle and line showing that data from last 21-days are tentative",
    subtitle = "")+ 
  theme_minimal()+
  
  # make rectangle covering last 21 days
  geom_rect(aes(
              xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
              xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
              ymin  = 0,
              ymax  = Inf,
              color = "Reporting delays\npossible"),    # sets label for legend (note: is within aes())
              alpha = .002,                             # !!! Difficult to adjust transparency with this option
              fill  = "red")+
  
  # make vertical line
  geom_segment(aes(x = max(central_data$date_onset, na.rm = T) - 21,
                   xend = max(central_data$date_onset, na.rm = T) - 21,
                   y = 0,
                   yend = Inf),
               color = "black",
               lty = "solid",
               size = 2)+
  theme(legend.title = element_blank())                 # remove title of legend

Multi-level date labels

If you want multi-level date labels (e.g. month and year) without duplicating the lower label levels, consider one of the approaches below:

Remember - you can can use tools like \n within the date_labels or labels arguments to put parts of each label on a new line below. However, the code below helps you take years or months (for example) on a lower line and only once.

A few notes on the code below:

  • Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details.
  • A line is used instead of a histogram, as the faceting approach below does not work well with histograms.

Aggregate to weekly counts

# Create dataset of case counts by week
#######################################
central_weekly <- linelist %>%
  filter(hospital == "Central Hospital") %>%           # filter linelist
  mutate(week = lubridate::floor_date(date_onset, unit = "weeks")) %>%  
  count(week, .drop=F) %>%                             # summarize weekly case counts
  filter(!is.na(week)) %>%                             # remove cases with missing onset_date
  complete(week = seq.Date(from = min(week),           # fill-in all weeks with no cases reported
                           to   = max(week),
                           by   = "week"))

Make plots

# plot with box border on year
##############################
ggplot(central_weekly) +
  geom_line(aes(x = week, y = n),    # make line, specify x and y
            stat = "identity") +             # because line height is count number
  scale_x_date(date_labels="%b",             # date label format show month 
               date_breaks="month",          # date labels on 1st of each month
               expand=c(0,0)) +              # remove excess space
  facet_grid(~lubridate::year(week), # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",                # x-axes adapt to data range (not "fixed")
             switch="x") +                   # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",         # facet labels placement
        strip.background = element_rect(fill = NA, # facet labels no fill grey border
                                        colour = "grey50"),
        panel.spacing = unit(0, "cm"))+      # no space between facet panels
  labs(title = "Nested year labels, grey label border")

# plot with no box border on year
#################################
ggplot(central_weekly,
       aes(x = week, y = n)) +              # establish x and y for entire plot
  geom_line(stat = "identity",              # make line, line height is count number
            color = "#69b3a2") +            # line color
  geom_point(size=1, color="#69b3a2") +     # make points at the weekly data points
  geom_area(fill = "#69b3a2",               # fill area below line
            alpha = 0.4)+                   # fill transparency
  scale_x_date(date_labels="%b",            # date label format show month 
               date_breaks="month",         # date labels on 1st of each month
               expand=c(0,0)) +             # remove excess space
  facet_grid(~lubridate::year(week),   # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",               # x-axes adapt to data range (not "fixed")
             switch="x") +                  # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",                     # facet label placement
          strip.background = element_blank(),            # no facet lable background
          panel.grid.minor.x = element_blank(),          
          panel.border = element_rect(colour="grey40"),  # grey border to facet PANEL
          panel.spacing=unit(0,"cm"))+                   # No space between facet panels
  labs(title = "Nested year labels - points, shaded, no label border")

The above techniques were adapted from this and this post on stackoverflow.com.

Aggregated data

Aggregating linelist data

To learn generally how to group and aggregate data, see the page on Grouping data.

In this circumstance, we demonstrate aggregating into weeks, months, and days.

Aggregating linelist into days

To aggregate a linelist into days, use the same approach but there is no need to create a new column. Use group_by() on the date column (e.g. date_onset).

If plotting a histogram, missing days in the data are not a problem as long as the column is class Date. However, it may be important for other types of plots or tables to have all possible days apear in the data. This is done with: tidyr::complete()

# Make dataset of weekly case counts
daily_counts <- linelist %>% 
  count(date_onset) %>%                           # count number of rows per unique date
  filter(!is.na(date_onset)) %>%                  # remove aggregation of rows that were missing date_onset
  complete(date_onset = seq.Date(min(date_onset), # ensure all days appear
                                 max(date_onset),
                                 by="day"))  

Aggregating linelist into weeks

Create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.

To aggregate into weeks and show ALL weeks (even ones with no cases), do this:

  1. Create a new ‘week’ column within mutate(), using floor_date() from the lubridate package:
    • use unit = to set the desired time unit, e.g. "week`
    • use week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)
  2. Follow with complete() to ensure that all weeks appear - even those with no cases.

For example:

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(
    week = lubridate::floor_date(date_onset,
                                 unit = "week")) %>%  # new column of week of onset
  count(week) %>%                                     # group data by week and count rows per group
  filter(!is.na(week)) %>%                            # remove entries for cases missing date_onset
  complete(week = seq.Date(from = min(week),          # fill-in all weeks with no cases reported
                           to = max(week),
                           by="week")) %>% 
  ungroup()                                           # deactivate grouping

Here are the first 50 rows of the resulting dataframe:

Alternatively, you can use the aweek package’s date2week() function. As shown below, set week_start = to “Sunday”, or “Monday”, etc. Set floor_date = TRUE so the output is YYYY-Www. Set factor = TRUE so that all possible weeks are included, even if there are no cases (this replaces the complete() step in the lubridate approach above). You can also use numeric = TRUE if you want only the week number (note this will not distinguish between years).

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(week = aweek::date2week(date_onset,          # new column of week of onset
                                 floor_day = T,       # show as weeks without weekday
                                 factor = TRUE)) %>%  # include all possible weeks
  count(week) %>% 
  ungroup()                                           # deactivate grouping

# Optional: add column of start DATE for each week - e.g. for ggplot() when date x-axis is expected
# note: add this step AFTER the above code, to ensure all weeks are present
weekly_counts <- weekly_counts %>% 
  mutate(week_as_date = aweek::week2date(week, week_start = "Monday")) # output is Monday date of each week

Aggregating linelist into months

To aggregate cases into months, again use floor_date() from the lubridate package, but with the argument unit = "months". This rounds each date down to the 1st of its month. The output will be class Date.

Note that in the complete() step we also use “months”

# Make dataset of weekly case counts
monthly_counts <- linelist %>% 
  mutate(month = lubridate::floor_date(date_onset, unit = "months")) %>%   # new column, 1st of month of onset
  count(month) %>% 
  filter(!is.na(month)) %>% 
  complete(month = seq.Date(min(month),     # fill-in all months with no cases reported
                            max(month),
                            by="month"))    

Plotting aggregated count data

Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve from ggplot() but the code will be slightly different. The incidence package does not currently support plotting of aggregated count data.

This section will utilize the count_data dataset that was imported earlier, in the data preparation section. It is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "Date"

Plotting daily counts

We can plot a daily epicurve from these daily counts. Here are the differences:

  • Specify y = to the counts column within the primary aesthetics aes()
  • Use of stat = "identity" within geom_histogram() indicates that the y-values could be counts from the y = column in aes()
ggplot(data = count_data, aes(x = as.Date(date_hospitalisation), y = n_cases))+
     geom_histogram(stat = "identity")+
     labs(x = "Date of report", 
          y = "Number of cases",
          Title = "Daily case incidence, from daily count data")

Plotting weekly counts

To aggregated the daily counts into weekly counts, we use the package lubridate and function floor_date(), as described above.

Note that we use group_by() and summarize() in place of count() because we need to sum() case counts instead of just counting the number of rows per group.

# Create weekly dataset with epiweek column
count_data_weekly <- count_data %>%
  mutate(epiweek = lubridate::floor_date(date_hospitalisation, "week")) %>% 
  group_by(hospital, epiweek, .drop=F) %>% 
  summarize(n_cases_weekly = sum(n_cases, na.rm=T))   

The first 50 rows of count_data_weekly are displayed below. You can see that the counts have been aggregated into weeks. Each week is displayed by the first day of the week (Monday by default).

YOu can also specify the factor level order of hospital (optional)

count_data_weekly <- count_data_weekly %>% 
  mutate(hospital = factor(hospital),
         hospital = fct_relevel(hospital,
                                c("Missing", "Port Hospital",
                                  "Military Hospital", "Central Hospital",
                                  "St. Mark's Maternity Hospital (SMMH)",
                                  "Other")))

Now plot by epiweek. Remember stat = "identity" when making the histogram.

ggplot(data = count_data_weekly,
       aes(x = epiweek,
           y = n_cases_weekly,
           group = hospital,
           fill = hospital))+
  
  geom_histogram(stat = "identity")+
     
  # labels for x-axis
  scale_x_date(date_breaks = "2 months",      # labels every 2 months 
               date_minor_breaks = "1 month", # gridlines every month
               date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(x = "Week of onset", 
       y = "Weekly case incidence",
       fill = "Hospital",
       title = "Weekly case incidence, from aggregated count data by hospital")

Dual-axis

Although there are fierce discussions about the validity of dual axes within the data visualization community, many epi supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

See the handbook page on ggplot tips for details on how to make a second axis.

Cumulative Incidence

If beginning with a case linelist, create a new column containing the cumulative number of cases per day in an outbreak using cumsum() from base R:

cumulative_case_counts <- linelist %>% 
  count(date_onset) %>%                # count of rows per day (returned in column "n")   
  mutate(                         
    cumulative_cases = cumsum(n)       # new column of the cumulative number of rows at each date
    )

The first 10 rows are shown below:

This cumulative column can then be plotted against date_onset, using geom_line():

plot_cumulative <- ggplot()+
  geom_line(
    data = cumulative_case_counts,
    aes(x = date_onset, y = cumulative_cases),
    size = 2,
    color = "blue")

plot_cumulative

It can also be overlaid onto the epicurve, with dual-axis using the cowplot method described in the ggplot tips page:

#load package
pacman::p_load(cowplot)

# Make first plot of epicurve histogram
plot_cases <- ggplot()+
  geom_histogram(          
    data = linelist,
    aes(x = date_onset),
    binwidth = 1)+
  labs(
    y = "Daily cases",
    x = "Date of symptom onset"
  )+
  theme_cowplot()

# make second plot of cumulative cases line
plot_cumulative <- ggplot()+
  geom_line(
    data = cumulative_case_counts,
    aes(x = date_onset, y = cumulative_cases),
    size = 2,
    color = "blue")+
  scale_y_continuous(
    position = "right")+
  labs(x = "",
       y = "Cumulative cases")+
  theme_cowplot()+
  theme(
    axis.line.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks = element_blank())

Now use cowplot to overlay the two plots. Attention has been paid to the x-axis alignment, side of the y-axis, and use of theme_cowplot().

aligned_plots <- align_plots(plot_cases, plot_cumulative, align="hv", axis="tblr")
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

Resources

Links to other online tutorials or resources.

Plot continuous data

For appropriate plotting of continuous data, such as age, clinical measurements, and distance.

Overview

Overview

Ggplot2, part of the Tidyverse family, is a fantastic and versatile package for visualising continuous data. As usual, R also has built-in functions, which can be helpful for quick looks at the data.

Visualisations covered here include:

  • Plots for one continuous variable:
    • Histograms, the classic graph to present the distribution of a continuous variable.
    • Box plots (also called box and whisker), in which the box represents the 25th, 50th, and 75th percentile of a continuous variable, and the line outside of this represent tail ends of distribution of the the continuous variable, and dots represent outliers.
    • Violin plots, which are similar to histograms in that they show the distribution of a continuous variable based on the symettrical width of the ‘violin’.
    • Jitter plots, which visualise the distribution of a continuous variable by showing all values as dots, rather than collectively as one larger shape. Each dot is ‘jittered’ so that they can all (mostly) be seen, even where two have the same value.
    • Sina plots, are a cross between jitter and violin plots, where the individual points can be seen but in the symmetrical shape of the distribution (note this brings in the ggforce package).
  • Scatter plots for two continuous variables.

Preparation

Preparation

Preparation includes loading the relevant packages, here ggplot2 and dplyr, and ensuring your data is the correct class and format. For the examples in this section, we use the simulated Ebola linelist, focusing on the continuous variables age, wt_kg (weight in kilos), ct_blood (CT values), and days_onset_hosp (difference between onset date and hospitalisation).

Note: You could load just tidyverse, which includes ggplot2 and tidyverse among other packages (stringr, tidyr, for instance).

pacman::p_load(ggplot2,
               dplyr)

linelist <- rio::import(here::here("data", "linelist_cleaned.rds")) %>% #Load the data
  mutate(age = as.numeric(age),
         ct_blood = as.numeric(ct_blood),
         days_onset_hosp = as.numeric(days_onset_hosp),
         wt_kg = as.numeric(wt_kg)) # Converting vars to numeric as examples 

You should have conducted various data checks before this point, including checking the missingness of the data.

Plotting with ggplot2

Plotting with ggplot2

Code syntax

Ggplot2 has extensive functionality, and the same code syntax can be used for many different plot types.

A basic breakdown of the ggplot code is as follows:

ggplot(data = linelist)+  
  geom_XXXX(aes(x = col1, y = col2),
       fill = "color") 
  • ggplot() starts off the function. You can specify the data and aesthetics (see next point) within the ggplot bracket, unless you are combining different data sources or plot types into one
  • aes() stands for ‘aesthetics’, and is where the columns used for the visualisation are specified. For instance aes(x = col1, y = col2) to specify the data used for the x and y values (where y is the continuous variable in these examples).
  • fill specifies the colour of the boxplot areas. One could also write color to specify outline or point colour.
  • geom_XXX specifies what type of plot. Options include:
    • geom_boxplot() for a boxplot
    • geom_histogram for a histogram
    • geom_violin() for a violin plot
    • geom_jitter() for a jitter plot
    • geom_point() for a scatter plot
    • geom_sina() for a jitter plot where the width of the jitter is controlled by the density distribution of the data within each class

Note that the aes() bracket can be within the ggplot() bracket or within the specific geom_XXX bracket. If you are layering different ggplots with diferent aesthetics, you will need to specify them within each geom_XXX.

For more see section on ggplot tips. We also walk through further customisation below.

Plotting one continuous variable

Box plots

Below is code for creating box plots, to show the distribution of CT values of Ebola patients in an entire dataset and by sub group. Note that for the subgroup breakdowns, the ‘NA’ values are also removed using dplyr, otherwise ggplot plots the age distribution for ‘NA’ as a separate boxplot.

# A) Simple boxplot of one numeric variable
ggplot(data = linelist, aes(y = ct_blood))+  # only y variable given (no x variable)
  geom_boxplot()+
  labs(title = "A) Simple ggplot2 boxplot")

# B) Box plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = ct_blood,                            # Continous variable
           x = outcome)) +                          # Grouping variable
  geom_boxplot(fill = "gold")+                      # Create the boxplot and specify colour
  labs(title = "B) ggplot2 boxplot by gender")      

Histograms

Below is code for generating histograms, to show the distribution of CT values of Ebola patients. Within the aes() bracket, you specify which variable you want to see the distribution of. You can supply either the x or the y, which will change the direction of the plot. The y or the x respectively will then show the count, represented by columns referred to as ‘bins’.

# A) Regular histogram
ggplot(data = linelist, aes(x = ct_blood))+  # provide x variable
  geom_histogram()+
  labs(title = "A) Simple ggplot2 histogram")

# B) Histogram with values across y axis
ggplot(data = linelist, aes(y = ct_blood))+  # provide y variable 
  geom_histogram()+
  labs(title = "B) Simple ggplot2 histogram with axes swapped")

In the examples above, R has guessed the most appropriate way to present the data, and issues a message to tell you how many bins (columns) it went with, and to prompt you to customise it yourself:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It used 30 bins, and they look spaced out because some of them have 0 values. This relates to the way the values have been rounded.

To change this, you can specify binwidth (e.g. the range of values that the bin is counting) or bins (the number of bins) within the geom_histogram argument. These are then evenly grouped, between the minimum and maximum values of the histogram.

# A) Histogram with specified bin number
ggplot(data = linelist, aes(x = ct_blood))+   # Provide x variable
  geom_histogram(bins=10,                     # Add bin number
                 color = "white")+            # Add white outline so bars can easily be distinguished
  labs(title = "A) Ggplot histogram with 10 bins")

# B) Histogram with specified bin width
ggplot(data = linelist, aes(x = ct_blood))+   # Provide y variable 
  geom_histogram(binwidth = 1,                # Each bar includes a CT value range of 1
                 color = "white")+            # Add white outline so bars can easily be distinguished
  labs(title = "B) Ggplot histogram with bindwidth of 1")

Rather than counts, you can change the stats within the aes() bracket to specify proportions - see (plot A) below. You can also layer different histograms with different settings (plot B).

# A) Histogram with proportion
ggplot(data = linelist, aes(x = ct_blood,           # provide x variable
                            y = stat(density)))+    # Calculate proportion
  geom_histogram(bins=10,                           # Add bin number
                 color = "white")+ # Add white outline so bars can easily be distinguished
  labs(title = "A) Ggplot histogram showing proportion")

# B) Layered histograms with different bin widths
ggplot(data = linelist, aes(x = ct_blood))+         # provide x variable 
  geom_histogram(binwidth = 2) +                    # Underlying layer has binwidth of 2
  geom_histogram(binwidth = 1,                      # Top layer has binwidth of 1
                 alpha = 0.4,                       # Set top layer to be slightly see through
                 fill = "blue")+ 
  labs(title = "B) Layered ggplot histograms")

Violin, jitter, and sina plots

Below is code for creating violin plots (geom_violin) and jitter plots (geom_jitter) to show age distributions. One can specify that the ‘fill’ or ’color’is also determined by the data, thereby inserting these options within the aes bracket.

# A) Violin plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,                                # Continuous variable
           x = outcome,                            # Grouping variable
           fill = outcome))+                       # fill variable (color of boxes)
  geom_violin()+                                   # create the violin plot
  labs(title = "A) ggplot2 violin plot by gender")    


# B) Jitter plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,                               # Continuous variable
           x = outcome,                           # Grouping variable
           color = outcome))+ # Color variable
  geom_jitter()+                                  # Create the violin plot
  labs(title = "B) ggplot2 jitter plot by gender")     

One can combine the two using the geom_sina option, which is actually part of the ggforce package. This can be easier to visually interpret. A) on the left shows basic layering of both a geom_violin and geom_sina. B) shows slightly more effort put into the appearance of the ggplot (see in-line comments).

pacman::p_load(ggforce)

# A) Sina plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,             # numeric variable
           x = outcome)) +      # group variable
  geom_violin()+                # create the violin plot
  geom_sina()+
  labs(title = "A) ggplot() violin and sina plot by gender")      


# A) Sina plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,             # numeric variable
           x = outcome)) +      # group variable
  geom_violin(aes(fill = outcome), # fill variable (color of violin background)
              color = "white",  # Plot has white outline rather than default black 
              alpha = 0.2)+     # Alpha value where 0 transparent to 1 opaque
  geom_sina(size=1,             # Change the size of the jitter
            aes(color = outcome))+ # color variable (color of dots)
  scale_fill_manual(values = c("Death" = "#bf5300", 
                        "Recover" = "#11118c")) + # Define colours for death/recover 
                                                  # (but note they will come out a bit transparent)
  scale_color_manual(values = c("Death" = "#bf5300", 
                         "Recover" = "#11118c")) + # Define colours for death/recover
  theme_minimal() +                                # Remove the gray background
  theme(legend.position = "none") +                # Remove unnecessary legend
  labs(title = "B) ggplot() violin and sina plot by gender with formatting")      

Plotting one continuous variable within facets

Faceting basics

To examine further subgroups, one can ‘facet’ the graph. This means the plot will be recreated within specified subgroups. One can use:

  • facet_wrap() - this will recreate the sub-graphs and present them alphabetically (typically, unless stated otherwise). You can invoke certain options to determine the look of the facets, e.g. nrow=1 or ncol=1 to control the number of rows or columns that the faceted plots are arranged within. See plot A below.
  • facet_grid() - this is suited to seeing subgroups for particular combinations of discrete variables. See plot B below. nrow and ncol are not relevant, as the subgroups are presented in a grid, with the subgroups always in the x or y axis (see notes in code below)

You can stipulate up to two faceting variables, with a ‘~’ between them. If only one faceting variable, a ‘.’ is used as a placeholder for a non-used second faceting variable - see code examples.

# A) Histogram of hospitalisation dates faceted by hospital
ggplot(data = linelist %>% 
         filter(hospital != "Missing"),               # filter removes unknown hospital
       aes(x = date_hospitalisation ))+
  geom_histogram(binwidth=7) +                        # Bindwidth = 7 days
  labs(title = "A) Ggplot 2 histogram of hospitalisation dates by hospital")+
  facet_wrap(hospital~.,                              # Facet by just hospital
            ncol = 2)                                 # Facet in two columns

# B) Boxplot of age faceted in a grid with two variables, gender and outcome
ggplot(data = linelist %>% 
         filter(!is.na(gender) & !is.na(outcome)),    # filter retains non-missing gender/outcome
       aes(y = age))+
  geom_boxplot()+
  labs(title = "A) A Ggplot2 boxplot by gender and outcome")+
  facet_grid(outcome~gender)                          # Outcome is the row, gender is the column

Further faceting options

The scales used when facetting are consistent across subgroups, which is helpful for comparisons, but not always appropriate or optimal.

When using facet_wrap or facet_grid, we can add scales = "free_y" (plot A) so that the heights of the faceted histograms are standardised and the shapes are easier to compare. This is particularly useful if the actual counts are small for one of the subcategories and trends are otherwise hard to see. Instead of free_y we can also write free_x to do the same for the x axis or free for both axes. Note that in facet_grid, the y scales will be the same for facets in the same row, and the x scales will be the same for facets in the same column.

When using facet_grid only, we can add space = "free_y" or space = "free_x" so that the actual height or width of the facet is weighted to the values of the figure within. This only works if scales = "free" (y or x) already applies.

# A) Facet hospitalsation date by hospital, free y axis
ggplot(data = linelist %>% filter(hospital != "Missing"), # filter removes unknown hospital
       aes(x = date_hospitalisation ))+
  geom_histogram(binwidth=7) + # Bindwidth = 7 days
  labs(title = "A) Histogram with free y axis scales")+
  facet_grid(hospital~., # Facet with hospital as the row 
             scales = "free_y") # Free the y scale of each facet

# B) Facet hospitalisation date by hospital, free y axis and vertical spacing
ggplot(data = linelist %>% filter(hospital != "Missing"), # filter removes unknown hospital
       aes(x = date_hospitalisation ))+
  geom_histogram(binwidth=7) + # Bindwidth = 7 days
  labs(title = "B) Histogram with free y axis scales and spacing")+
  facet_grid(hospital~., # Facet with hospital as the row 
             scales = "free_y", # Free the y scale of each facet
             space = "free_y") # Free the vertical spacing of each facet to optimise space

Plotting two continuous variables

Following similar syntax, geom_point will allow one to plot two continuous variables against eachother in a scatter plot. This is useful for showing actual values rather than their distributions.

A basic scatter plot of age vs weight is shown in (A). In (B) we again use facet_grid to show the relationship between two continuous variables in the linelist.

# Basic scatter plot of weight and age
ggplot(data = linelist, 
       aes(y = wt_kg, x = age))+
  geom_point() +
  labs(title = "A) Scatter plot of weight and age")

# Scatter plot of weight and age by gender and Ebola outcome
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
       aes(y = wt_kg, x = age))+
  geom_point() +
  labs(title = "B) Scatter plot of weight and age faceted by gender and outcome")+
  facet_grid(gender~outcome) 

Plotting with base graphics

In-built graphics package

Using base graphics can sometimes be quicker than ggplot, and is helpful for that initial first look.

Plotting one continuous variable

Box plots and histograms

The in-built graphics package comes with the boxplot() and hist() functions, allowing straight-forward visualisation of a continuous variable.

# Boxplot
boxplot(linelist$wt_kg,
                  main = "A) Base boxplot") 


# Histogram
hist(linelist$wt_kg,
                  main = "B) Base histogram") 

Further customisation

Subgroups can also be shown, by subgroup or crossed groups. Note how with plot B below, outcome and gender are written as outcome*gender such that the boxplots are for the four combinations of the two columns. They do not get facetted across different rows and columns like in ggplot2.

We specify linelist as the dataset so we do not need to write age as linelist$age

# Box plot by subgroup
boxplot(age ~ outcome,
                  data = linelist, 
                  main = "A) Base boxplot by subgroup")

# Box plot by crossed subgroups
boxplot(age ~ outcome*gender,
                  data = linelist, 
                  main = "B) Base boxplot) by crossed groups")

Some further options with boxplot() shown below are:

  • Boxplot width proportional to sample size (A)
  • Violin plots, with notched representing the median and x around it (B)
  • Horizontal (C)
# Varying width by sample size 
boxplot(linelist$age ~ linelist$outcome,
                  varwidth = TRUE, # width varying by sample size
                  main="A) Proportional boxplot() widths")

                  
# Notched (violin plot), and varying width
boxplot(age ~ outcome,
        data=linelist,
        notch=TRUE,      # notch at median
        main="B) Notched boxplot()",
        col=(c("gold","darkgreen")),
        xlab="Suppliment and Dose")

# Horizontal
boxplot(age ~ outcome,
        data=linelist,
        horizontal=TRUE,  # flip to horizontal
        col=(c("gold","darkgreen")),
        main="C) Horizontal boxplot()",
        xlab="Suppliment and Dose")

Plotting two continuous variables

Using base R, we can quickly visualise the relationship between two continuous variables with the plot function.

plot(linelist$age, linelist$wt_kg)

Resources

Resources

There is a huge amount of help online, especially with ggplot. see:

Plot categorical data

For appropriate plotting of categorical data, e.g. the distribution of sex, symptoms, ethnic group, etc.

Overview

Overview

In this section we cover use of R’s built-in functions or functions from the ggplot2 package to visualise categorical/categorical data. The additional functionality of ggplot2 compared to R means we recommend it for presentation-ready visualisations.

We cover visualising distributions of categorical values, as counts and proportions.

Preparation

Preparation

Load packages and data

Preparation includes loading the relevant packages, namely ggplot2 for examples covered here. We also load the data.

# Load packages we will be using repeatedly
pacman::p_load(ggplot2, # Package for visualisation
       dplyr,           # Package for data management
       forcats)         # Package for factors

# Load data using rio package
linelist <- rio::import(here::here("data", "linelist_cleaned.rds"))

Process columns for analysis

For the examples in this section, we use the simulated Ebola linelist, focusing on the categorical variables hospital, and outcome. These need to be the correct class and format.

Let’s take a look at the hospital column.

# View class of hospital column - we can see it is a character
class(linelist$hospital)
## [1] "character"
# Look at values held within hospital column
table(linelist$hospital)
## 
##                     Central Hospital                    Military Hospital                              Missing 
##                                  454                                  896                                 1469 
##                                Other                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                  885                                 1762                                  422

We can see the values within are characters, as they are hospital names, and by default they are ordered alphabetically. There are ‘other’ and ‘missing’ values, which we would prefer to be the last subcategories when presenting breakdowns. So we change this column into a factor and re-order it. This is covered in more detail in the ‘factors’ data management section.

# Change hospital to factor variable
linelist <- linelist %>% 
  mutate(hospital = factor(hospital))

# Define the levels of factor with forcats - so other and missing are last
linelist <- linelist %>% 
  mutate(hospital = fct_relevel(hospital, 
                                c("St. Mark's Maternity Hospital (SMMH)", 
                                  "Port Hospital", 
                                  "Central Hospital",
                                  "Military Hospital",
                                  "Other",
                                  "Missing")))

Ensure correct data structure

For displaying frequencies and distributions of categorical variables, you have the option of creating plots based on:

  • The linelist data, with one row per observation, or
  • A summary table based on the linelist, with one row per category. An example is below to show the use of dplyr to create a table of case counts per hospital.

Tables can be created using the ‘table’ method for built-in graphics. The useNA = "ifany" arguments ensures that missing values are included, as table otherwise automatically excludes them.

#Table method
  outcome_nbar <- table(linelist$outcome, 
                        useNA = "ifany")

  outcome_nbar # View full table
## 
##   Death Recover    <NA> 
##    2582    1983    1323

Or using other data management packages such as dplyr. In this example we add on a percentage column.

#Dplyr method
  outcome_n <- linelist %>% 
    group_by(outcome) %>% 
    count %>% 
    ungroup() %>% # Ungroup so proportion is out of total
    mutate(proportion = n/sum(n)*100) # Caculate percentage
  
  
   outcome_n #View full table
## # A tibble: 3 x 3
##   outcome     n proportion
##   <chr>   <int>      <dbl>
## 1 Death    2582       43.9
## 2 Recover  1983       33.7
## 3 <NA>     1323       22.5

Filter to relevant data

You may consider dropping rows not needed for this analysis. For instance, for the next few examples we want to understand trends amongst persons with a known outcome, so we drop rows with missing outcome column values.

#Drop missing from full linelist
linelist <- linelist %>% 
  filter(!is.na(outcome))

#Drop missing from dplyr table
outcome_n <- outcome_n %>% 
  filter(!is.na(outcome))

Plotting with ggplot2

Plotting with ggplot2

Code syntax

Ggplot has extensive functionality, and the same code syntax can be used for many different plot types.

Similar to the plotting continuous data section, basic breakdown of the ggplot code is as follows:

ggplot(data = linelist)+  
  geom_XXXX(aes(x = col1, y = col2),
       fill = "color") 
  • ggplot() starts off the function. You can specify the data and aesthetics (see next point) within the ggplot bracket, unless you are combining different data sources or plot types into one
  • aes() stands for ‘aesthetics’, and is where the columns used for the visualisation are specified. For instance aes(x = col1, y = col2) to specify the data used for the x and y values.
  • fill specifies the colour of bars, or of the subgroups if specified within the aes breacket.
  • geom_XXX specifies what type of plot. Options include:
    • geom_bar() for a bar chart based on a linelist
    • geom_col() for a bar chart based on a table with values (see preparation section)

Note that the aes() bracket can be within the ggplot() bracket or within the specific geom_XXX bracket. If you are layering different ggplots with diferent aesthetics, you will need to specify them within each geom_XXX.

For more see section on ggplot tips.

Bar charts using raw data

Below is code using geom_bar for creating some simple bar charts to show frequencies of Ebola patient outcomes: A) For all cases, and B) By hospital.

In the aes bracket, only x needs to be specified - or y if you want the bars presented horizontally. Ggplot knows that the unspecified y (or x) will be the number of observations that fall into those categories.

# A) Outcomes in all cases
ggplot(linelist) + 
  geom_bar(aes(x=outcome)) +
  labs(title = "A) Number of recovered and dead Ebola cases")


# B) Outcomes in all cases by hosptial
ggplot(linelist) + 
  geom_bar(aes(x=outcome, fill = hospital)) +
  theme(axis.text.x = element_text(angle = 90)) + # Add preference to rotate the x axis text
  labs(title = "B) Number of recovered and dead Ebola cases, by hospital")

Bar charts using processed data

Below is code using geom_col for creating simple bar charts to show the distribution of Ebola patient outcomes. With geom_col, both x and y need to be specified. Here x is the categorical variable along the x axis, and y is the generated proportions column proportion.

# Outcomes in all cases
ggplot(outcome_n) + 
  geom_col(aes(x=outcome, y = proportion)) +
  labs(subtitle = "Number of recovered and dead Ebola cases")

To show breakdowns by hospital, an additional table needs to be created for frequencies of the combined categories outcome and hospital.

outcome_n2 <- linelist %>% 
  group_by(hospital, outcome) %>% 
  count() %>% 
  group_by(hospital) %>% # Group so proportions are out of hospital total
  mutate(proportion = n/sum(n)*100)

head(outcome_n2) #Preview data
## # A tibble: 6 x 4
## # Groups:   hospital [3]
##   hospital                             outcome     n proportion
##   <fct>                                <chr>   <int>      <dbl>
## 1 St. Mark's Maternity Hospital (SMMH) Death     199       61.2
## 2 St. Mark's Maternity Hospital (SMMH) Recover   126       38.8
## 3 Port Hospital                        Death     785       57.6
## 4 Port Hospital                        Recover   579       42.4
## 5 Central Hospital                     Death     193       53.9
## 6 Central Hospital                     Recover   165       46.1

We then create the ggplot with some added formatting:

  • Axis flip: Swapped the axis around with coord_flip() so that we can read the hospital names.
  • Columns side-by-side: Added a position = "dodge" argument so that the bars for death and recover are presented side by side rather than stacked. Note stacked bars are the default.
  • Column width: Specified ‘width’, so the columns are half as thin as the full possible width.
  • Column order: Reversed the order of the categories on the y axis so that ‘Other’ and ‘Missing’ are at the bottom, with scale_x_discrete(limits=rev). Note that we used that rather than scale_y_discrete because hospital is stated in the x argument of aes(), even if visually it is on the y axis. We do this because Ggplot seems to present categories backwards unless we tell it not to.
  • Other details: Labels/titles and colours added within labs and scale_fill_color respectively.
# Outcomes in all cases by hospital
ggplot(outcome_n2) +  
  geom_col(aes(x=hospital, 
               y = proportion, 
               fill = outcome),
           width = 0.5,          # Make bars a bit thinner (out of 1)
           position = "dodge") + # Bars are shown side by side, not stacked
  scale_x_discrete(limits=rev) + # Reverse the order of the categories
  theme_minimal() +              # Minimal theme 
  coord_flip() +
  labs(subtitle = "Number of recovered and dead Ebola cases, by hospital",
       fill = "Outcome",        # Legend title
       x = "Count",             # X axis title
       y = "Hospital of admission")  + # Y axis title
  scale_fill_manual(values = c("Death"= "#3B1c8C",
                               "Recover" = "#21908D" )) 

Note that the proportions are binary, so we may prefer to drop ‘recover’ and just show the proportion who died. This is just for illustration purposes though.

Facetting

We can also use faceting to create futher mini-graphs, which is detailed with examples in the continuous data visualisation section. Specifically, one can use:

  • facet_wrap() - this will recreate the sub-graphs and present them alphabetically (typically, unless stated otherwise). You can invoke certain options to determine the look of the facets, e.g. nrow=1 or ncol=1 to control the number of rows or columns that the faceted plots are arranged within.
  • facet_grid() - this is suited to seeing subgroups for particular combinations of categorical variables.

Plotting with base graphics

In-built graphics package

Bar charts

To create bar plots in R, we create a frequency table using the table function. This creates an object of a table class, that R can recognise for plotting. We can create a simple frequency graph showing Ebola case outcomes (A), or add in colours to present outcomes by gender (B).

Note that NA values are excluded from these plots by default.

# A) Outcomes in all cases
outcome_nbar <- table(linelist$outcome)
barplot(outcome_nbar, main= "A) Outcomes")

# B) Outcomes in all cases by gender of case
outcome_nbar2 <- table(linelist$outcome, linelist$gender) # The first column is for groupings within a bar, the second is for the separate bars
barplot(outcome_nbar2, legend.text=TRUE, main = "B) Outcomes by gender") # Specify inclusion of legend

Resources

Resources

There is a huge amount of help online, especially with ggplot. see:

Tables

This section demonstrates how to create publication-ready tables, which can be inserted directly into shareable documents, including R Markdown outputs.

Overview

We build on previous sections on basic statistics and creating summary tables (e.g. using dplyr and gtsummary and show how to create publication-read tables. The primary package we use is flextable, which is compatible with multiple R Markdown formats, including html and word documents.

Example:

Table of Ebola patients with outcome information: Number, proportion, and CT values of cases who recovered and died

Preparation

Using packages discussed in other sections such as gtsummary and dplyr, create a table with the content of interest, with the correct columns and rows.

Here we create a simple summary table of patient outcomes using the Ebola linelist. We are interested in knowing the number and proportion of patients that recover or died, as well as their median CT values, by hospital of admission.

table <- linelist %>% 
  group_by(hospital, outcome) %>% 
  filter(!is.na(outcome) & hospital!="Missing") %>%  # Remove cases with missing outcome/hospital
  summarise(ct_value = median(ct_blood), N = n()) %>%  # Calculate indicators of interest 
  pivot_wider(values_from=c(ct_value, N), names_from = outcome) %>% #Pivot from long to wide
  mutate(`N known` = `N_Death` + N_Recover) %>% # Calculate total number
  arrange(-`N known`) %>% # Arrange rows from highest to lowest total
  mutate(`Prop_Death` = `N_Death`/`N known`*100,  # Calculate proportions
         `Prop_Recover` = `N_Recover`/`N known`*100) %>% 
  select(hospital, `N known`, `N_Recover`, `Prop_Recover`, ct_value_Recover,
         `N_Death`, `Prop_Death`, ct_value_Death) # Re-order columns 


table
## # A tibble: 5 x 8
## # Groups:   hospital [5]
##   hospital                             `N known` N_Recover Prop_Recover ct_value_Recover N_Death Prop_Death ct_value_Death
##   <chr>                                    <int>     <int>        <dbl>            <dbl>   <int>      <dbl>          <dbl>
## 1 Port Hospital                             1364       579         42.4               22     785       57.6             22
## 2 Military Hospital                          708       309         43.6               21     399       56.4             22
## 3 Other                                      685       290         42.3               21     395       57.7             22
## 4 Central Hospital                           358       165         46.1               22     193       53.9             22
## 5 St. Mark's Maternity Hospital (SMMH)       325       126         38.8               21     199       61.2             22

Load, and install if necessary, flextable, which we will use to convert the above table into a fully formatted and presentable table.

pacman::p_load(flextable)

Basic flextable

Creating a flextable

To create and manage flextable objects, we pass the table object through the flextable function and progressively add more formatting and features using the dplyr syntax.

The syntax of each line of flextable code is as follows:

  • function(table, i = X, j = X, part = "X"), where:
    • table is the name of the table object, although does not need to be stated if using the dplyr syntax and the table name has already been specified (see examples).
    • The ‘function’ can be one of many different functions, such as width to determine column widths, bg to set background colours, align to set whether text is centre/right/left aligned, and so on.
    • part refers to which part of the table the function is being applied to. E.g. “header”, “body” or “all”.
    • i specifies the row to apply the function to, where ‘X’ is the row number. If multiple rows, e.g. the first to third rows, one can specify: i = c(1:3). Note if ‘body’ is selected, the first row starts from underneath the header section.
    • j specifies the column to apply the function to, where ‘x’ is the column number or name. If multiple rows, e.g. the fifth and sixth, one can specify: j = c(5,6).
ftable <- flextable(table) 
ftable

We see immediately that it has suboptimal spacing, and the proportions have too many decimal places.

Formatting cell content

We edit the proportion colums to one decimal place using flextable code. Note this could also have been done at data management stage with the round() function.

ftable <- colformat_num(ftable, j = c(4,7), digits = 1)
ftable

Formatting column width

We can use the autofit() function, which nicely stretches out the table so that each cell only has one row of text.

ftable %>% autofit()

However, this might not always be appropriate, especially if there are very long values within cells, meaning the table might not fit on the page.

Instead, we can specify widths. It can take some playing around to know what width value to put. In the example below, we specify different widths for column 1, column 2, and columns 4 to 8.

ftable <- ftable %>% 
  width(j=1, width = 2.7) %>% 
  width(j=2, width = 1.5) %>% 
  width(j=c(4,5,7,8), width = 1)

ftable

Column headers

We want to clearer headers for easier interpretation of table contents.

First we can add an extra header layer for clarity. We do this with the add_header_row with ‘top’ set to true, so that columns about the same subgroups can be grouped together. We also rename the now-second header layer. Finally we merge the columns in the top header row.

ftable <- ftable %>% 
  add_header_row( values = c("Hospital", 
                             "Total cases with known outcome", 
                             "Recovered", 
                             "",
                             "",
                             "Died", # This and the next two columns will be one value ' died
                             "",     # As merging will keep the first column, next two are blank
                             ""), 
                  top = T) %>% # New header goes on top of existing header row
    set_header_labels(hospital = "", # Rename the columns in original header row
                    `N known` = "",                  
                    N_Recover = "Total",
                    Prop_Recover = "% of cases",
                    ct_value_Recover = "Median CT values",
                    N_Death = "Total",
                    Prop_Death = "% of cases",
                     ct_value_Death = "Median CT values")  %>% 
  merge_at(i = 1, j = 3:5, part = "header") %>% # Horizontally merge columns in new header row
  merge_at(i = 1, j = 6:8, part = "header")  

ftable

Formatting borders and background

Flextable has default borders that do not respond well to additional header levels. We start from scratch by removing the existing borders with border_remove. Then we add a black line to the bottom of the table using hlinw, by specifying the 5th row of the table body. Flextable will default add a line to the bottom of the row. In order to add black lines to the top of sections, we need to use hline_top.

We also use fp_border here, which actually applied the border. This is a function from the officer package.

library(officer)

ftable <- ftable %>% 
  border_remove() %>% # Remove existing borders 
  hline(part = "body", i=5, border = fp_border(color="black", width=2)) %>% 
  hline_top(part = "header", border = fp_border(color="black", width=2)) %>%
  hline_top(part = "body", border = fp_border(color="black", width=2)) 


ftable

Font and alignment

We centre-align all columns aside from the left-most column with the hospital names, using the align function.

ftable <- ftable %>% 
   flextable::align(align = "center", j = c(2:8), part = "all") 
ftable

Additionally, we can increase the header font size and change then to bold.

ftable <-  ftable %>%  
  fontsize(i = 1, size = 12, part = "header") %>% 
    bold(i = 1, bold = TRUE, part = "header")

ftable

Background

To distinguish the content of the table from the headers, we may want to add additional formatting. e.g. changing the background colour. In this example we change the table body to gray.

ftable <- ftable %>% 
    bg(part = "body", bg = "gray95")  

ftable 

Conditional flextable formatting

We can highlight all values in a column that meet a certain rule, e.g. where more than 55% of cases died.

ftable %>% 
  bg(j=7, i= ~ Prop_Death >=55, part = "body", bg = "red") 

Or, we can higlight the entire row meeting a certain criterion, such as a hospital of interest. This is particularly helpful when looping through e.g. reports per geographical area, to highlight in tables where the current iteration compares to the other geographies. To do this we just remove the column (j) specification.

ftable %>% 
  bg(., j=c(1:8), i= ~ hospital == "Military Hospital", part = "body", bg = "#91c293") 

Saving your table

There are different ways the table can be integrated into your output.

Save single table

You can export the tables to Word, PowerPoint or HTML or as an image (PNG) files. To do this, one of the following functions is used:

  • save_as_docx
  • save_as_pptx
  • save_as_image
  • save_as_html

For instance:

save_as_docx("my table" = ftable, path = "file.docx")
# Edit the 'my table' as needed for the title of table. If not specified the whole file will be blank. 

save_as_image(ftable, path = "file.png")
## [1] "C:/Users/Neale/OneDrive - Neale Batra/Documents/Analytic Software/R/Projects/R handbook/Epi_R_handbook/file.png"

Note the packages webshot or webshot2 are required to save a flextable as an image.Images may come out with transparent backgrounds.

If you want to view a ‘live’ versions of the flextable output in the intended document format, for instance so you can see if it fits in the page or so you can copy it into another document, you can use the print method with the argument preview set to “pptx” or “docx”. The document will pop up.

print(ftable, preview = "docx") # Word document example
print(ftable, preview = "pptx") # Powerpoint example

Save table to R markdown document

This table can be integrated into your an automated document, an R markdown output, if the table object is called within the R markdown chunk. This means the table can be updated as part of a report where the data might change, so the numbers can be refreshed.

See detail in the R markdown section of this handbook.

Resources

The full flextable explanation is here: https://ardata-fr.github.io/flextable-book/

Age pyramids and Likert-scales

Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics. These tabs demonstrate how to produce age pyramids using:

  • Fast & easy: Using the apyramid package
  • More flexible: Using ggplot()
  • Having baseline demographics displayed in the background of the pyramid
  • Using pyramid-style plots to show other types of data (e.g responses to Likert-style survey questions)

Overview

Age/gender demographic pyramids in R are generally made with ggplot() by creating two barplots (one for each gender), converting one’s values to negative values, and flipping the x and y axes to display the barplots vertically.

Here we offer a quick approach through the apyramid package:

  • More customizable code using the raw ggplot() commands
  • How to combine case demographic data and compare with that of a baseline population (as shown above)
  • Application of these methods to show other types of data (e.g. responses to Likert-style survey questions)

Preparation

For this tab we use the linelist dataset that is cleaned in the Cleaning tab.

To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:

  • The gender column must be cleaned.
  • Age should be in an age category column, and should be an of class Factor (with correctly ordered levels)

Load packages

First, load the packages required for this analysis:

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
               apyramid,  # a package dedicated to creating age pyramids
               stringr)   # working with strings for titles, captions, etc.

Load the data

linelist <- rio::import("linelist_cleaned.csv")

Check class of variables

Ensure that the age variable is class Numeric, and check the class and order of levels of age_cat and age_cat5

class(linelist$age_years)
## [1] "numeric"
class(linelist$age_cat)
## [1] "factor"
class(linelist$age_cat5)
## [1] "factor"
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1066  1103   918   773  1102   724   102    13    87
table(linelist$age_cat5, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84   85+  <NA> 
##  1066  1103   918   773   646   456   305   211   126    82    49    35    13     5     5     5     1     2    87

apyramid package

The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.

Linelist data

Using the cleaned linelist dataset, we can create an age pyramid with just one simple command. If you need help cleaning your data, see the handbook page on Cleaning data and core functions (LINK). In this command:

  • The data argument is set as the linelist dataframe
  • The age_group argument is set to the name (in quotes) of the numeric category variable (in this case age_cat5)
  • The split_by argument (bar colors) should be a binary column (in this case “gender”)
apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender")

The same result can be shown as percents of all cases, instead of counts, by setting proportional = TRUE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE)

When using agepyramid package, if the split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted barplot with empty bars in the background indicating the range of the un-faceted data set for the age group. Values of split_by will appear as labels at top of each facet. For example below if the split_by variable is “hospital”.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "hospital",
                      na.rm = FALSE)        # show a bar for patients missing age, (note: this changes the pyramid into a faceted barplot)

Missing values
Rows missing values for the split_by or age_group columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      na.rm = FALSE)         # show patients missing age or gender

Proportions, colors, & aesthetics

By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can all be adjusted, as shown below:

You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE,                  # show percents, not counts
                      show_midpoint = FALSE,                # remove bar mid-point line
                      #pal = c("orange", "purple")          # can specify alt. colors here (but not labels, see below)
                      )+                 
  
  # additional ggplot commands
  theme_minimal()+                                          # simplify the background
  scale_fill_manual(values = c("orange", "purple"),         # to specify colors AND labels
                     labels = c("Male", "Female"))+
  labs(y = "Percent of all cases",                          # note that x and y labels are switched (see ggplot tab)
       x = "Age categories",                          
       fill = "Gender", 
       caption = "My data source and caption here",
       title = "Title of my plot",
       subtitle = "Subtitle with \n a second line...")+
  theme(
    legend.position = "bottom",                             # move legend to bottom
    axis.text = element_text(size = 10, face = "bold"),     # fonts/sizes, see ggplot tips page
    axis.title = element_text(size = 12, face = "bold"))

Aggregated data

The examples above assume your data are in a linelist-like format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.

This code aggregates the linelist data into counts by age category and gender, in a “wide” format. Learn more about Grouping data and Pivoting data in their respective pages:

demo_agg <- linelist %>% 
  count(age_cat5, gender, name = "cases") %>% 
  pivot_wider(id_cols = age_cat5, names_from = gender, values_from = cases) %>% 
  rename(`missing_gender` = `NA`)

…which makes the dataset looks like this: with columns for age category, and male counts, female counts, and missing counts.

To set-up these data for the age pyramid, we will pivot the data to be “long” with the pivot_longer() function from dplyr. This is because ggplot() generally prefers “long” data, and apyramid is using ggplot().

# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>% 
  pivot_longer(c(f, m, missing_gender),            # cols to elongate
               names_to = "gender",                # name for new col of categories
               values_to = "counts") %>%           # name for new col of counts
  mutate(gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA

Then use the split_by and count arguments of age_pyramid() to specify the respective columns:

apyramid::age_pyramid(data = demo_agg_long,
                      age_group = "age_cat5",
                      split_by = "gender",
                      count = "counts")      # give the column name for the aggregated counts

Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggredated data as a Factor and order the levels as desired.

ggplot()

Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.

apyramid uses ggplot() in the background (and accepts ggplot() commands added), but this page shows how to adjust or recreate a pyramid only using ggplot(), if you wish.

Constructing the plot

First, understand that to make such a pyramid using ggplot() the approach is to:

  • Within the ggplot(), create two graphs by age category. Create one for each of the two grouping values (in this case gender). See filters applied to the data arguments in each geom_histogram() commands below.

  • If using geom_histogram(), the graphs operate off the numeric column (e.g. age_years), whereas if using geom_barplot() the graphs operate from an ordered Factor (e.g. age_cat5).

  • One graph will have positive count values, while the other will have its counts converted to negative values - this allows both graphs to be seen and compared against each other in the same plot.

  • The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.

  • Lastly, the counts-axis labels must be specified so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).

A simple version of this, using geom_histogram(), is below:

  # begin ggplot
  ggplot(data = linelist, aes(x = age, fill = gender)) +
  
  # female histogram
  geom_histogram(data = filter(linelist, gender == "f"),
                 breaks = seq(0,85,5),
                 colour = "white") +
  
  # male histogram (values converted to negative)
  geom_histogram(data = filter(linelist, gender == "m"),
                 breaks = seq(0,85,5),
                 aes(y=..count..*(-1)),
                 colour = "white") +
  
  # flip the X and Y axes
  coord_flip() +
  
  # adjust counts-axis scale
  scale_y_continuous(limits = c(-600, 900),
                     breaks = seq(-600,900,100),
                     labels = abs(seq(-600, 900, 100)))

DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.

There are many things you can change/add to this simple version, including:

  • Auto adjust counts-axis count scale to your data (avoid errors discussed in warning below)
  • Manually specify colors and legend labels

Convert counts to percents:

# create dataset with proportion of total
pyramid_data <- linelist %>%
  count(age_cat5, gender, name = "counts") %>% 
  ungroup() %>%                                   # ungroup so percent calculations are not by group
  mutate(percent = round(100*(counts / sum(counts, na.rm=T)),1), 
         percent = case_when(
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))

Importantly here we save the max and min (which is the max in the opposite direction) percents so we know how tall the scale should be. These will be used in the ggplot() command below.

max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

max_per
## [1] 11.1
min_per
## [1] -7

Finally we make the ggplot() on the percent data. We specify scale_y_continuous() to extend the pre-defined lengths in each direction (positive and “negative”). We use floor() and ceiling() to round decimals the appropriate direction (down or up) for the side of the axis.

# begin ggplot
  ggplot()+  # default x-axis is age in years;

  # case data graph
  geom_bar(data = pyramid_data,
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        # 
           colour = "white")+         # white around each bar
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  

  # adjust the axes scales (remember they are flipped now!)
  #scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
  scale_y_continuous(limits = c(min_per, max_per),
                     breaks = seq(floor(min_per), ceiling(max_per), 2),
                     labels = paste0(abs(seq(floor(min_per), ceiling(max_per), 2)), "%"))+

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",
               "m" = "darkgreen"),
    labels = c("Female", "Male"),
  ) +
  
  # label values (remember X and Y flipped now)
  labs(
    x = "Age group",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust=0, size=11, face = "italic")) + 
  
  ggtitle(paste0("Age and gender of cases"))

Compare to baseline

With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.

Import and view the population data

# import the population demographics data
pop <- rio::import("country_demographics.csv")

First some data management steps:

Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, it is easiest to store these as a character vector and use them later in the plotting function.

# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
                "25-29","30-34", "35-39", "40-44", "45-49",
                "50-54", "55-59", "60-64", "65-69", "70-74",
                "75-79", "80-84", "85+")

Combine the population and case data through the dplyr function bind_rows():

  • First, ensure they have the exact same column names, age categories values, and gender values
  • Make them have the same data structure: columns of age category, gender, counts, and percent of total
  • Bind them together, one on-top of the other (bind_rows())
# create/transform populaton data, with percent of total
########################################################
pop_data <- pivot_longer(pop, c(m, f), names_to = "gender", values_to = "counts") %>% # pivot gender columns longer
  mutate(data = "population",                                                         # add column designating data source
         percent  = round(100*(counts / sum(counts, na.rm=T)),1),                     # calculate % of total
         percent  = case_when(                                                        # if male, convert % to negative
                            gender == "f" ~ percent,
                            gender == "m" ~ -percent,
                            TRUE          ~ NA_real_))

Review the changed population dataset

Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.

# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
  group_by(age_cat5, gender) %>%  # aggregate linelist cases into age-gender groups
  summarize(counts = n()) %>%     # calculate counts per age-gender group
  ungroup() %>% 
  mutate(data = "cases",                                          # add column designating data source
         percent = round(100*(counts / sum(counts, na.rm=T)),1),  # calculate % of total for age-gender groups
         percent = case_when(                                     # convert % to negative if male
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))

Review the changed case dataset

Now the two datasets are combined, one on top of the other (same column names)

# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows(case_data, pop_data)

Store the maximum and minimum percent values, used in the plotting funtion to define the extent of the plot (and not cut off any bars!)

# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

Now the plot is made with ggplot():

  • One bar graph of population data (wider, more transparent bars)
  • One bar graph of case data (small, more solid bars)
# begin ggplot
##############
ggplot()+  # default x-axis is age in years;

  # population data graph
  geom_bar(data = filter(pyramid_data, data == "population"),
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        
           colour = "black",                               # black color around bars
           alpha = 0.2,                                    # more transparent
           width = 1)+                                     # full width
  
  # case data graph
  geom_bar(data = filter(pyramid_data, data == "cases"), 
           stat = "identity",                              # use % as given in data, not counting rows
           aes(x = age_cat5,                               # age categories as original X axis
               y = percent,                                # % as original Y-axis
               fill = gender),                             # fill of bars by gender
           colour = "black",                               # black color around bars
           alpha = 1,                                      # not transparent 
           width = 0.3)+                                   # half width
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  
  # adjust axes order, scale, and labels (remember X and Y axes are flipped now)
  # manually ensure that age-axis is ordered correctly
  scale_x_discrete(limits = age_levels)+ 
  
  # set percent-axis 
  scale_y_continuous(limits = c(min_per, max_per),                                          # min and max defined above
                     breaks = seq(floor(min_per), ceiling(max_per), by = 2),                # from min% to max% by 2 
                     labels = paste0(                                                       # for the labels, paste together... 
                       abs(seq(floor(min_per), ceiling(max_per), by = 2)),                  # ...rounded absolute values of breaks... 
                       "%"))+                                                               # ... with "%"
                                                                                            # floor(), ceiling() round down and up 

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",         # assign colors to values in the data
               "m" = "darkgreen"),
    labels = c("f" = "Female",
               "m"= "Male"),      # change labels that appear in legend, note order
  ) +

  # plot labels, titles, caption    
  labs(
    title = "Case age and gender distribution,\nas compared to baseline population",
    subtitle = "",
    x = "Age category",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    legend.position = "bottom",                             # move legend to bottom
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0), 
    plot.caption = element_text(hjust=0, size=11, face = "italic"))

Likert scale

The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.

Import the data

# import the likert survey response data
likert_data <- rio::import("likert_data.csv")

Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).

First, some data management steps:

  • Pivot the data longer
  • Create new column direction depending on whether response was generally “positive” or “negative”
  • Set the Factor level order for the status column and the Response column
  • Store the max count value so limits of plot are appropriate
melted <- pivot_longer(likert_data, Q1:Q8, names_to = "Question", values_to = "Response") %>% 
     mutate(direction = case_when(
               Response %in% c("Poor","Very Poor") ~ "Negative",
               Response %in% c("Good", "Very Good") ~ "Positive",
               TRUE ~ "Unknown"),
            status = factor(status, levels = rev(c(
                 "Senior", "Intermediate", "Junior"))),
            Response = factor(Response, levels = c("Very Good", "Good",
                                             "Very Poor", "Poor"))) # must reverse Very Poor and Poor for ordering to work

melted_max <- melted %>% 
   group_by(status, Question) %>% 
   summarize(n = n())

Just like in the ggplot() age pyramid, we save the max value to dynamically calibrate our axis scale.

melted_max <- max(melted_max$n, na.rm=T)
melted_max
## [1] 18

Now make the plot:

# make plot
ggplot()+
     # bar graph of the "negative" responses 
     geom_bar(data = filter(melted,
                            direction == "Negative"), 
              aes(x = status,
                        y=..count..*(-1),    # counts inverted to negative
                        fill = Response),
                    color = "black",
                    closed = "left", 
                    position = "stack")+
     
     # bar graph of the "positive responses
     geom_bar(data = filter(melted, direction == "Positive"),
              aes(x = status, fill = Response),
              colour = "black",
              closed = "left",
              position = "stack")+
     
     # flip the X and Y axes
     coord_flip()+
  
     # Black vertical line at 0
     geom_hline(yintercept = 0, color = "black", size=1)+
     
    # convert labels to all positive numbers
    scale_y_continuous(limits = c(-ceiling(melted_max/10)*11, ceiling(melted_max/10)*10),   # seq from neg to pos by 10, edges rounded outward to nearest 5
                       breaks = seq(-ceiling(melted_max/10)*10, ceiling(melted_max/10)*10, 10),
                       labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
                                            seq(0, ceiling(melted_max/10)*10, 10))))) +
     
    # color scales manually assigned 
    scale_fill_manual(values = c("Very Good"  = "green4", # assigns colors
                                  "Good"      = "green3",
                                  "Poor"      = "yellow",
                                  "Very Poor" = "red3"),
                       breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
     
    
     
    # facet the entire plot so each question is a sub-plot
    facet_wrap(~Question, ncol = 3)+
     
    # labels, titles, caption
    labs(x = "Respondent status",
          y = "Number of responses",
          fill = "")+
     ggtitle(str_glue("Likert-style responses\nn = {nrow(likert_data)}"))+

     # aesthetic settings
     theme_minimal()+
     theme(axis.text = element_text(size = 12),
           axis.title = element_text(size = 14, face = "bold"),
           strip.text = element_text(size = 14, face = "bold"),  # facet sub-titles
           plot.title = element_text(size = 20, face = "bold"),
           panel.background = element_rect(fill = NA, color = "black")) # black box around each facet

Resources

Diagrams

Overview

This page covers:

  • Flow diagrams using DiagrammeR
  • Alluvial/Sankey diagrams
  • Event timelines
  • Dendrogram organizational trees (e.g. of folder contents)
  • DAGs (Directed Acyclic Graphs)

Preparation

Load packages

pacman::p_load(
  DiagrammeR,     # for flow diagrams
  networkD3       # For alluvial/Sankey diagrams
  )

Flow diagrams

One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.

Tools

The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.

Basic structure

  1. Open the instructions grViz("
  2. Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {
  3. Graph statement (layout, rank direction)
  4. Nodes statements (create nodes)
  5. Edges statements (gives links between nodes)
  6. Close the instructions }")

Simple examples

Simple examples

Below are two simple examples

A very minimal example:

# A minimal plot
DiagrammeR::grViz("digraph {
  
graph[layout = dot, rankdir = LR]

a
b
c

a -> b -> c
}")

An example with applied public health context:

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,
         overlap = true,
         fontsize = 10]
  
  # nodes
  #######
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]               # width of circles
  
  Primary                         # names of nodes
  Secondary
  Tertiary

  # edges
  #######
  Primary   -> Secondary [label = 'case transfer']
  Secondary -> Tertiary [label = 'case transfer']
}
")

Syntax

Basic syntax

Node names, or edge statements, can be separated with spaces, semicolons, or newlines.

Rank direction

A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.

Node names

Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. A label is also necessary to have a newline within the node name - use \n in the node label within single quotes, as shown below.

Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.

Layouts

  • dot (set rankdir to either TB, LR, RL, BT, )
  • neato
  • twopi
  • circo

Nodes - editable attributes

  • label (text, in single quotes if multi-word)
  • fillcolor (many possible colors)
  • fontcolor
  • alpha (transparency 0-1)
  • shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)
  • style
  • sides
  • peripheries
  • fixedsize (h x w)
  • height
  • width
  • distortion
  • penwidth (width of shape border)
  • x (displacement left/right)
  • y (displacement up/down)
  • fontname
  • fontsize
  • icon

Edges - editable attributes

  • arrowsize
  • arrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)
  • arrowtail
  • dir (direction, )
  • style (dashed, …)
  • color
  • alpha
  • headport (text in front of arrowhead)
  • tailport (text in behind arrowtail)
  • fontname
  • fontsize
  • fontcolor
  • penwidth (width of arrow)
  • minlen (minimum length)

Color names: hexadecimal values or ‘X11’ color names, see here for X11 details

Complex examples

The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling

DiagrammeR::grViz("               # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            # layout top-to-bottom
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      
  
  Primary   [label = 'Primary\nFacility'] 
  Secondary [label = 'Secondary\nFacility'] 
  Tertiary  [label = 'Tertiary\nFacility'] 
  SC        [label = 'Surveillance\nCoordination',
             fontcolor = darkgreen] 
  
  # edges
  #######
  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  
  # grouped edge
  {Primary Secondary Tertiary} -> SC [label = 'case reporting',
                                      fontcolor = darkgreen,
                                      color = darkgreen,
                                      style = dashed]
}
")

Sub-graph clusters

To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have each subgraph identified within a bounding box, begin the name of the subgraph with “cluster”, as shown with the 4 boxes below.

DiagrammeR::grViz("             # All instructions are within a large character string
digraph surveillance_diagram {  # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            
         overlap = true,
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,                  # shape = circle
       fixedsize = true
       width = 1.3]                      # width of circles
  
  subgraph cluster_passive {
    Primary   [label = 'Primary\nFacility'] 
    Secondary [label = 'Secondary\nFacility'] 
    Tertiary  [label = 'Tertiary\nFacility'] 
    SC        [label = 'Surveillance\nCoordination',
               fontcolor = darkgreen] 
  }
  
  # nodes (boxes)
  ###############
  node [shape = box,                     # node shape
        fontname = Helvetica]            # text font in node
  
  subgraph cluster_active {
    Active [label = 'Active\nSurveillance']; 
    HCF_active [label = 'HCF\nActive Search']
  }
  
  subgraph cluster_EBD {
    EBS [label = 'Event-Based\nSurveillance (EBS)']; 
    'Social Media'
    Radio
  }
  
  subgraph cluster_CBS {
    CBS [label = 'Community-Based\nSurveillance (CBS)'];
    RECOs
  }

  
  # edges
  #######
  {Primary Secondary Tertiary} -> SC [label = 'case reporting']

  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red]
  
  HCF_active -> Active
  
  {'Social Media'; Radio} -> EBS
  
  RECOs -> CBS
}
")

Node shapes

The example below, borrowed from this tutorial, shows applied node shapes and a shorthand for serial edge connections

DiagrammeR::grViz("digraph {

graph [layout = dot, rankdir = LR]

# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]

data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label =  'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']

# edge definitions with the node IDs
{data1 data2}  -> process -> statistical -> results
}")

Outputs

How to handle and save outputs

  • Outputs will appear in RStudio’s Viewer pane, by default in the lower-right alongside Files, Plots, Packages, and Help.
  • To export you can “Save as image” or “Copy to clipboard” from the Viewer. The graphic will adjust to the specified size.

Parameterized figures

Parameterized figures

“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index. Here is a basic example:”
https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/

# Define some sample data
data <- list(a=1000, b=800, c=600, d=400)


DiagrammeR::grViz("
digraph graph2 {

graph [layout = dot]

# node definitions with substituted label text
node [shape = rectangle, width = 4, fillcolor = Biege]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']

a -> b -> c -> d

}

[1]:  paste0('Raw Data (n = ', data$a, ')')
[2]: paste0('Remove Errors (n = ', data$b, ')')
[3]: paste0('Identify Potential Customers (n = ', data$c, ')')
[4]: paste0('Select Top Priorities (n = ', data$d, ')')
")

Much of the above is adapted from the tutorial at this site

Other more in-depth tutorial: http://rich-iannone.github.io/DiagrammeR/

Alluvial/Sankey Diagrams

Preparation

Load packages

pacman::p_load(networkD3)

Plotting from dataset

Plotting the connections in a dataset

https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html

Counts of age category and hospital, relabled as target and source, respectively.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  count(hospital, age_cat) %>% 
  rename(source = hospital,
         target = age_cat)

Now formalize the nodes list, and adjust the ID columns to be numbers instead of labels:

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

Now plot the Sankey diagram:

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

Here is an example where the patient Outome is included as well. Note in the data management step how we bind rows of counts of hospital -> outcome, using the same column names.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  mutate(age_cat = stringr::str_glue("Age {age_cat}")) %>% 
  count(hospital, age_cat) %>% 
  rename(source = age_cat,
         target = hospital) %>% 
  bind_rows(
    linelist %>% 
      select(hospital, outcome) %>% 
      count(hospital, outcome) %>% 
      rename(source = hospital,
             target = outcome)
  )

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

https://www.displayr.com/sankey-diagrams-r/

Timeline Sankey - LTFU from cohort… application/rejections… etc.

Event timelines

To make a timeline showing specific events, you can use the vistime package.

See this vignette

# load package
pacman::p_load(vistime,  # make the timeline
               plotly    # for interactive visualization
               )

Here is the events dataset we begin with:

p <- vistime(data)    # apply vistime

library(plotly)

# step 1: transform into a list
pp <- plotly_build(p)

# step 2: Marker size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "markers") pp$x$data[[i]]$marker$size <- 10
}

# step 3: text size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textfont$size <- 10
}


# step 4: text position
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textposition <- "right"
}

#print
pp

DAGs

You can build a DAG manually using the DiagammeR package and DOT language, as described in another tab. Alternatively, there are packages like ggdag and daggity

https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-dags.html

https://www.r-bloggers.com/2019/08/causal-inference-with-dags-in-r/#:~:text=In%20a%20DAG%20all%20the,for%20drawing%20and%20analyzing%20DAGs.

Resources

Links to other online tutorials or resources.

Combinations analysis

Overview

This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency of symptom combinations.

This analysis is often called:

  • Multiple response analysis
  • Sets analysis
  • Combinations analysis

The first method shown uses the package ggupset, an the second using the package UpSetR.

An example plot is below. Five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.

Preparation

Load packages

pacman::p_load(
  tidyverse,
  UpSetR,
  ggupset)

View the data

This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot.

View the data (scroll to the right to see the symptoms variables). The first 50 rows are shown.

Re-format values

To align with the format expected by ggupset we convert the “yes” and “no the the actual symptom name, using case_when() from dplyr. If”no", we set the value as blank.

# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ "fever",          # if old value is "yes", new value is "fever"
                           TRUE           ~ NA_character_),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ "chills",
                           TRUE           ~ NA_character_),
         
         cough = case_when(cough == "yes" ~ "cough",
                           TRUE           ~ NA_character_),
         
         aches = case_when(aches == "yes" ~ "aches",
                           TRUE           ~ NA_character_),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ "shortness_of_breath",
                           TRUE           ~ NA_character_))

Now we make two final variables:
1. Pasting together all the symptoms of the patient (character variable)
2. Convert the above to class list, so it can be accepted by ggupset to make the plot

linelist_sym_1 <- linelist_sym_1 %>% 
  mutate(
         # combine the variables into one, using paste() with a semicolon separating any values
         all_symptoms = paste(fever, chills, cough, aches, shortness_of_breath, sep = "; "),
         
         # make a copy of all_symptoms variable, but of class "list" (which is required to use ggupset() in next step)
         all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
         )

View the new data. Note the two columns at the end - the pasted combined values, and the list

ggupset

Load the package

pacman::p_load(ggupset)

Create the plot. We begin with a ggplot() and geom_bar(), but then we add the special scale_x_upset() from the package.

ggplot(
  data = linelist_sym_1,
  aes(x = all_symptoms_list)) +
geom_bar() +
scale_x_upset(
  reverse = FALSE,
  n_intersections = 10,
  sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"))+
  labs(title = "Signs & symptoms",
       subtitle = "10 most frequent combinations of signs and symptoms",
       caption = "Caption here.",
       x = "Symptom combination",
       y = "Frequency in dataset")

More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab ?ggupset.

UpSetR

The UpSetR package allows more customization of the plot, but it can be more difficult to execute:

Load package

pacman::p_load(UpSetR)

Data cleaning

We must convert the linelist symptoms values to 1/0.

# Make using upSetR

linelist_sym_2 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ 1,          # if old value is "yes", new value is "fever"
                           TRUE           ~ 0),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ 1,
                           TRUE           ~ 0),
         
         cough = case_when(cough == "yes" ~ 1,
                           TRUE           ~ 0),
         
         aches = case_when(aches == "yes" ~ 1,
                           TRUE           ~ 0),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ 1,
                           TRUE           ~ 0))

Now make the plot using the custom function upset() - using only the symptoms columns. You must designate which “sets” to compare (the names of the symptom columns). Alternatively, use nsets = and order.by = "freq" to only show the top X combinations.

# Make the plot
UpSetR::upset(
  select(linelist_sym_2, fever, chills, cough, aches, shortness_of_breath),
  sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"),
  order.by = "freq",
  sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
  empty.intersections = "on",
  # nsets = 3,
  number.angles = 0,
  point.size = 3.5,
  line.size = 2, 
  mainbar.y.label = "Symptoms Combinations",
  sets.x.label = "Patients with Symptom")

Resources

https://github.com/hms-dbmi/UpSetR read this https://gehlenborglab.shinyapps.io/upsetr/ Shiny App version - you can upload your own data https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf documentation - difficult to interpret

Heatmaps

Heatmaps can be useful visualizations. Below we demonstrate two examples:

  • Creating a visual matrix of transmission events by age (“who infected whom”)
  • Tracking reporting metrics across many facilities/jurisdictions over time

Preparation

Load packages

pacman::p_load(
  tidyverse,       # data manipulation and visualization
  rio,             # importing data 
  lubridate        # working with dates
  )

Datasets

This page utilizes the case linelist for the transmission matrix section, and a separate dataset of daily malaria case counts by facility for the metrics tracking section. They are loaded and cleaned in their individual sections.

Transmission matrix

Heat tiles can be useful to visualize matrices. One example is to display “who-infected-whom” in an outbreak. This assumes that you have information on transmission events in your linelist.

We begin from the case linelist:

  • There is one row per case
  • There is a column that contains the case_id of the infector, who is also a case in the linelist

The first 50 rows of the linelist are shown below for demonstration:

We load the case linelist

linelist <- import("linelist_cleaned.xlsx")

Objective: We need to achieve a “long”-style dataframe that contains the frequency of transmission events between each age category. This will take several data manuipulation steps to achieve.

To begin, we create a dataframe of the cases and their ages, called case_ages. The first rows are displayed below.

case_ages <- linelist %>% 
  select(case_id, infector, age_cat) %>% 
  rename("case_age_cat" = "age_cat")

Next, we create a dataframe of the infectors - at the moment it consists of a single column. These are the infector IDs from the linelist. Not every case has a known infector, so we remove missing values. The first rows are displayed below.

infectors <- linelist %>% 
  select(infector) %>% 
  filter(!is.na(infector))

Next, we use joins to procure the ages of the infectors. This is not simple, because in the linelist, infector’s ages are not listed as such. We achieve this result by joining the case linelist to the infectors - joining such that the infector in the left-side “baseline” dataframe links to the case_id in the right-side linelist. Thus, the data from the infector’s case record in the linelist (including age) is added to the infector row. The first rows are displayed below.

infector_ages <- infectors %>%             # begin with infectors
  left_join(                               # add the linelist data to each infector  
    linelist,
    by = c("infector" = "case_id")) %>%    # match infector to their information as a case
  select(infector, age_cat) %>%            # keep only columns of interest
  rename("infector_age_cat" = "age_cat")   # rename for clarify

Then, we combine the cases and their ages with the infectors and their ages. Each of these dataframe has the column infector, so it is used for the join. The first rows are displayed below:

ages_complete <- case_ages %>%  
  left_join(
    infector_ages,
    by = "infector") %>%        # each has the column infector
  drop_na()                     # drop rows with any missing data

Below, a simple cross-tabulation of counts between the case and infector age groups. Labels added for clarity.

table(cases = ages_complete$case_age_cat,
      infectors = ages_complete$infector_age_cat)
##        infectors
## cases   0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+
##   0-4   127 104    89    98   103    98    24   0
##   5-9   163 125    83    84   125    82    17   0
##   10-14 127 102   121    65   110    85    10   1
##   15-19 106  78    93    47    95    42     9   3
##   20-29 130 114    89   111   148   113    14   1
##   30-49  83 107    84    42    81    59    21   4
##   50-69  12   5    17    15     9     8     0   3
##   70+     2   0     0     0     3     0     0   0

We can convert this table to a dataframe with data.frame() from base R, which also automatically converts it to “long” format, which is desired for the ggplot(). The first rows are shown below.

long_counts <- data.frame(table(
    cases     = ages_complete$case_age_cat,
    infectors = ages_complete$infector_age_cat))

The same but we apply prop.table() from base R to the table so instead of counts we get proportion of all values. The first rows are shown below.

long_prop <- data.frame(prop.table(table(
    cases = ages_complete$case_age_cat,
    infectors = ages_complete$infector_age_cat)))

Now finally we can create the heatmap with ggplot2 package, using the geom_tile() function.

  • In the aesthetics aes() of geom_tile() sex the x and y as the case age and infector age
  • Also in aes() set the argument fill = to the Freq column - this is the value that will be converted to a tile color
  • Set a scale color with scale_fill_gradient() - you can specify the high/low colors
    • Note that scale_color_gradient() is different! In this case you want the fill
  • Because the color is made via “fill”, you can use the fill = argument in labs() to change the legend title
ggplot(data = long_prop)+       # use long data, with proportions as Freq
  geom_tile(                    # visualize it in tiles
    aes(
      x = cases,         # x-axis is case age
      y = infectors,     # y-axis is infector age
      fill = Freq))+            # color of the tile is the Freq column in the data
  scale_fill_gradient(          # adjust the fill color of the tiles
    low = "blue",
    high = "orange")+
  labs(                         # labels
    x = "Case age",
    y = "Infector age",
    title = "Who infected whom",
    subtitle = "Frequency matrix of transmission events",
    fill = "Proportion of all\ntranmsission events"     # legend title
  )

Reporting metrics over time

Often in public health, one objective is to assess trends over time for many entities (facilities, jurisdictions, etc.). One way to visualize such trends over time is a heatmap where the x-axis is time and the y-axis are the many entities.

Preparation

We begin with a dataset of daily malaria reports from many facilities. The reports contain a date, province, district, and malaria counts

Below are the first 30 rows of these data:

And we also load this separate dataset of daily malaria case counts by facility:

facility_count_data <- import("facility_count_data.rds")

Aggregate and summarize

The objective in this example is to transform the daily facility total malaria case counts (seen in previous tab) into weekly summary statistics of facility reporting performance - in this case the proportion of days per week that the facility reported any data. For this example we will show data only for Spring District from April-May 2019.

To achieve this we will do the following data management steps:

  1. Filter the data as appropriate (by place, date)
  2. Create a week column using floor_date() from package lubridate
    • This function returns the start-date of a given date’s week, using a specified start date of each week (e.g. “Mondays”)
  3. The data are grouped by columns “location” and “week” to create analysis units of “facility-week”
  4. The function summarise() creates new columns to reflecting summary statistics per facility-week group:
    • Number of days per week (7 - a static value)
    • Number of reports received from the facility-week (could be more than 7!)
    • Sum of malaria cases reported by the facility-week (just for interest)
    • Number of unique days in the facility-week for which there is data reported
    • Percent of the 7 days per facility-week for which data was reported
  5. The dataframe is joined (right_join()) to a comprehensive list of all possible facility-week combinations, to make the dataset complete. The matrix of all possible combinations is created by applying expand() to those two columns of the dataframe as it is at that moment in the pipe chain (represented by “.”). Because a right_join() is used, all rows in the expand() dataframe are kept, and added to agg_weeks if necessary. These new rows appear with NA (missing) summarized values.

Below we demonstrate step-by-step:

# Create weekly summary dataset
agg_weeks <- facility_count_data %>% 
  
  # filter the data as appropriate
  filter(
    District == "Spring",
    data_date < as.Date("2019-06-01")) 

Now the dataset has 584 rows, when it previously had 3038

Next we create a week column reflecting the start date of the week for each record. This is achieved with the lubridate package and the function floor_date(), which is set to “week” and for the weeks to begin on Mondays (day 1 of the week - Sundays would be 7). The top rows are shown below.

agg_weeks <- agg_weeks %>% 
  # Create week column from data_date
  mutate(
    week = lubridate::floor_date(                     # create new column of weeks
      data_date,                                      # date column
      unit = "week",                                  # give start of the week
      week_start = 1))                                # weeks to start on Mondays 

The new week column can be seen on the far right of the dataframe

Now we group the data into facility-weeks and summarise them to produce statistics per facility-week. See the page on Grouping data for tips. The grouping itself doesn’t change the dataframe, but it impacts how the subsequent summary statistics are calculated.

The top rows are shown below. Note how the columns have completely changed to reflect the desired summary statistics. Each row reflects one facility-week.

agg_weeks <- agg_weeks %>%   

  # Group into facility-weeks
  group_by(
    location_name, week,
    .drop = F) %>%
  
  # Create summary statistics columns on the grouped data
  summarize(
    n_days          = 7,                                          # 7 days per week           
    n_reports       = dplyr::n(),                                 # number of reports received per week (could be >7)
    malaria_tot     = sum(malaria_tot, na.rm = T),                # total malaria cases reported
    n_days_reported = length(unique(data_date)),                  # number of unique days reporting per week
    p_days_reported = round(100*(n_days_reported / n_days)))      # percent of days reporting

Finally, we run the command below to ensure that ALL possible facility-weeks are present in the data, even if they were missing before.

We are using a right_join() on itself (the dataset is represented by “.”) but having been expanded to include all possible combinations of the columns week and location_name. See documentation on the expand() function in the page on [Pivoting]. Before running this code the dataset contains 97 rows.

# Create dataframe of every possible facility-week
expanded_weeks <- agg_weeks %>% 
  mutate(week = as.factor(week)) %>%         # convert date to a factor so expand() works correctly
  tidyr::expand(., week, location_name) %>%  # expand dataframe to include all possible facility-week combinations
                                             # note: "." represents the dataset at that moment in the pipe chain
  mutate(week = as.Date(week))               # re-convert week to class Date so the subsequent right_join works
                                             

# Use right-join with the expanded facility-week list to fill-in the missing gaps in the data
agg_weeks <- agg_weeks %>%      
  right_join(expanded_weeks) %>%                            # Ensure every possible facility-week combination appears in the data
  mutate(p_days_reported = replace_na(p_days_reported, 0))  # convert missing values to 0                           

Before running this code the dataset contains 165 rows.

Create heatmap

The ggplot() is made using geom_tile() from the ggplot2 package:

  • Weeks on the x-axis is transformed to dates, allowing use of scale_x_date()
  • location_name on the y-axis will show all facility names
  • The fill is the performance for that facility-week (numeric)
  • scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NA
  • scale_x_date() is used on the x-axis specifying labels every 2 weeks and their format
  • Aesthetic themes and labels can be adjusted as necessary

Basic

A basic heatmap is produced below,using the default colors, scales, etc. Within the aes() for geom_tile() you must provide an x-axis column, y-axis column, and a column for the the fill = - these are the numeric values that are converted to tile color.

ggplot(data = agg_weeks)+
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported))

Cleaned plot

We can make this plot look better by adding additional ggplot2 functions, as shown below. See the page on ggplot tips for details.

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

Ordered y-axis

Currently, the facilities are ordered “alphabetically” from the bottom to the top. If you want to adjust the order the y-axis facilities, convert them to class factor and provide the order. See the page on Factors for tips.

Below, the column location_name is converted to a factor, and the order of its levels is set based on the total number of reporting days filed by the facility across the whole time-span.

To do this, we create a dataframe which represents the total number of reports per facility, arranged in ascending order. We can use this vector to order the factor levels in the plot.

facility_order <- agg_weeks %>% 
  group_by(location_name) %>% 
  summarize(tot_reports = sum(n_days_reported, na.rm=T)) %>% 
  arrange(tot_reports) # ascending order

See the dataframe below:

Now use the above vector (facility_order$location_name) to be the order of the factor levels of location_name in the dataframe agg_weeks:

# load package 
pacman::p_load(forcats)

# create factor and define levels manually
agg_weeks <- agg_weeks %>% 
  mutate(location_name = as_factor(location_name),
         location_name = fct_relevel(location_name, 
                                     levels = facility_order$location_name))

And now the data are re-plotted, with location_name being an ordered factor:

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

Display values

You can add a geom_text() layer on top of the tiles, to display the actual numbers of each tile. Be aware this may not look pretty if you have many small tiles!

The following code has been added: geom_text(aes(label = p_days_reported)). This adds text onto every tile. The text displayed is the value assigned to the argument label =, which in this case has been set to the same numeric column p_days_reported that is used to create the color gradient.

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  # text
  geom_text(
    aes(
      x = week,
      y = location_name,
      label = p_days_reported))+          # add text on top of tile
  
  # fill scale
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

Resources

Transmission chains

Overview

The primary tool to handle, analyse and visualise transmission chains and contact tracing data is the package epicontacts, developed by the folks at RECON. Try out the interactive plot below by hovering over the nodes for more information, dragging them to move them and clicking on them to highlight downstream cases.

Preparation

Packages and data

First load the standard packages required for data import and manipulation.

pacman::p_load(
   rio,          # File import
   here,         # File locator
   tidyverse,    # Data management + ggplot2 graphics
   remotes       # Package installation from github
)

You will require the development version of epicontacts, which can be installed from github using the remotes package. You only need to run the code below once, not every time you use the package.

remotes::install_github("reconhub/epicontacts@timeline")

Next, import the standard, cleaned linelist for this analysis.

# import the cleaned linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Creating an epicontacts object

We then need to create an epicontacts object, which requires two types of data:

  • a linelist documenting cases where columns are variables and rows correspond to unique cases
  • a list of edges defining links between cases on the basis of their unique IDs (these can be contacts, transmission events, etc.)

As we already have a linelist, we just need to create a list of edges between cases, more specifically between their IDs. We can extract transmission links from the linelist by linking the infector column with the case_id column. At this point we can also add “edge properties”, by which we mean any variable describing the link between the two cases, not the cases themselves. For illustration, we will add a location variable describing the location of the transmission event, and a duration variable describing the duration of the contact in days.

In the code below, the dplyr function transmute is similar to mutate, except it only keeps the columns we have specified within the function. The drop_na function will filter out any rows where the specified columns have an NA value; in this case, we only want to keep the rows where the infector is known.

## generate contacts
contacts <- linelist %>%
  transmute(
    infector = infector,
    case_id = case_id,
    location = sample(c("Community", "Nosocomial"), n(), TRUE),
    duration = sample.int(10, n(), TRUE)
  ) %>%
  drop_na(infector)

We can now create the epicontacts object using the make_epicontacts function. We need to specify which column in the linelist points to the unique case identifier, as well as which columns in the contacts point to the unique identifiers of the cases involved in each link. These links are directional in that infection is going from the infector to the case, so we need to specify the from and to arguments accordingly. We therefore also set the directed argument to TRUE, which will affect future operations.

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts,
  id = "case_id",
  from = "infector",
  to = "case_id",
  directed = TRUE
)

Upon examining the epicontacts objects, we can see that the case_id column in the linelist has been renamed to id and the case_id and infector columns in the contacts have been renamed to from and to. This ensures consistency in subsequent handling, visualisation and analysis operations.

## view epicontacts object
epic
## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 5,888 cases in linelist; 3,800 contacts; directed 
## 
##   // linelist
## 
## # A tibble: 5,888 x 30
##    id    generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5
##    <chr>      <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>   
##  1 a3c8~          4 2014-05-07     2014-05-08 2014-05-10       2014-05-14   Recover m          1 years            1 0-4     0-4     
##  2 d8a1~          4 2014-05-06     2014-05-08 2014-05-10       NA           <NA>    f          4 years            4 0-4     0-4     
##  3 5fe5~          4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m         21 years           21 20-29   20-24   
##  4 8689~          4 NA             2014-05-13 2014-05-14       2014-05-18   Recover f          2 years            2 0-4     0-4     
##  5 11f8~          2 NA             2014-05-16 2014-05-18       2014-05-30   Recover m         27 years           27 20-29   25-29   
##  6 893f~          3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m         25 years           25 20-29   25-29   
##  7 be99~          3 2014-05-03     2014-05-22 2014-05-23       2014-05-24   Recover f         18 years           18 15-19   15-19   
##  8 d052~          7 2014-05-20     2014-05-24 2014-05-26       2014-06-05   <NA>    f          2 years            2 0-4     0-4     
##  9 ce9c~          5 2014-05-27     2014-05-27 2014-05-29       2014-06-17   Death   m         20 years           20 20-29   20-24   
## 10 275c~          5 2014-05-24     2014-05-27 2014-05-28       2014-06-07   Death   f          4 years            4 0-4     0-4     
## # ... with 5,878 more rows, and 17 more variables: hospital <chr>, lon <dbl>, lat <dbl>, infector <chr>, source <chr>, wt_kg <dbl>,
## #   ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>,
## #   bmi <dbl>, days_onset_hosp <dbl>
## 
##   // contacts
## 
## # A tibble: 3,800 x 4
##    from   to     location   duration
##    <chr>  <chr>  <chr>         <int>
##  1 2ae019 a3c8b8 Community         8
##  2 20b688 d8a13d Community         9
##  3 f547d6 5fe599 Nosocomial        2
##  4 11f8ea 893f25 Nosocomial        7
##  5 aec8ec be99c8 Nosocomial        1
##  6 4b38b7 d0523a Community         5
##  7 53da57 ce9c02 Community        10
##  8 e02f66 275cc7 Nosocomial       10
##  9 893f25 07e3e8 Nosocomial        7
## 10 5387a2 2b8773 Community         9
## # ... with 3,790 more rows

Handling

Subsetting

The subset() method for epicontacts objects allows for, among other things, filtering of networks based on properties of the linelist (“node attributes”) and the contacts database (“edge attributes”). These values must be passed as named lists to the respective argument. For example, in the code below we are keeping only the male cases in the linelist that have an infection date between April and July 2014 (dates are specified as ranges), and transmission links that occured in the hospital.

sub_attributes <- subset(
  epic,
  node_attribute = list(
    gender = "m",
    date_infection = as.Date(c("2014-04-01", "2014-07-01"))
  ), 
  edge_attribute = list(location = "Nosocomial")
)
sub_attributes
## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 70 cases in linelist; 1,900 contacts; directed 
## 
##   // linelist
## 
## # A tibble: 70 x 30
##    id    generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5
##    <chr>      <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>   
##  1 a3c8~          4 2014-05-07     2014-05-08 2014-05-10       2014-05-14   Recover m          1 years            1 0-4     0-4     
##  2 5fe5~          4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m         21 years           21 20-29   20-24   
##  3 893f~          3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m         25 years           25 20-29   25-29   
##  4 ce9c~          5 2014-05-27     2014-05-27 2014-05-29       2014-06-17   Death   m         20 years           20 20-29   20-24   
##  5 be50~          5 2014-06-09     2014-06-15 2014-06-16       2014-06-19   Death   m         33 years           33 30-49   30-34   
##  6 4cff~          6 2014-06-05     2014-06-15 2014-06-16       2014-06-25   <NA>    m         34 years           34 30-49   30-34   
##  7 c36e~          8 2014-06-15     2014-06-20 2014-06-21       2014-06-24   Death   m         51 years           51 50-69   50-54   
##  8 02d8~          9 2014-06-14     2014-06-20 2014-06-20       2014-07-01   Death   m         51 years           51 50-69   50-54   
##  9 b799~          5 2014-06-27     2014-07-03 2014-07-05       2014-07-12   Recover m         34 years           34 30-49   30-34   
## 10 da8e~          5 2014-06-20     2014-07-18 2014-07-20       2014-08-01   <NA>    m         39 years           39 30-49   35-39   
## # ... with 60 more rows, and 17 more variables: hospital <chr>, lon <dbl>, lat <dbl>, infector <chr>, source <chr>, wt_kg <dbl>,
## #   ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>,
## #   bmi <dbl>, days_onset_hosp <dbl>
## 
##   // contacts
## 
## # A tibble: 1,900 x 4
##    from   to     location   duration
##    <chr>  <chr>  <chr>         <int>
##  1 f547d6 5fe599 Nosocomial        2
##  2 11f8ea 893f25 Nosocomial        7
##  3 aec8ec be99c8 Nosocomial        1
##  4 e02f66 275cc7 Nosocomial       10
##  5 893f25 07e3e8 Nosocomial        7
##  6 cbbe78 057e7a Nosocomial        4
##  7 ba7326 4cff96 Nosocomial        5
##  8 3ff1bc a6c614 Nosocomial        7
##  9 057e7a c36eb4 Nosocomial        2
## 10 e61cb9 02d8fd Nosocomial        4
## # ... with 1,890 more rows

We can use the thin function to either filter the linelist to include cases that are found in the contacts by setting the argument what = "linelist", or filter the contacts to include cases that are found in the linelist by setting the argument what = "contacts". In the code below, we are further filtering the epicontacts object to keep only the transmission links involving the male cases infected between April and July which we had filtered for above. We can see that only two known transmission links fit that specification.

sub_attributes <- thin(sub_attributes, what = "contacts")
nrow(sub_attributes$contacts)
## [1] 1

In addition to subsetting by node and edge attributes, networks can be pruned to only include components that are connected to certain nodes. The cluster_id argument takes a vector of case IDs and returns the linelist of individuals that are linked, directly or indirectly, to those IDs. In the code below, we can see that a total of 13 linelist cases are involved in the clusters containing 2ae019 and 71577a.

sub_id <- subset(epic, cluster_id = c("2ae019","71577a"))
nrow(sub_id$linelist)
## [1] 13

The subset() method for epicontacts objects also allows filtering by cluster size using the cs, cs_min and cs_max arguments. In the code below, we are keeping only the cases linked to clusters of 10 cases or larger, and can see that 271 linelist cases are involved in such clusters.

sub_cs <- subset(epic, cs_min = 10)
nrow(sub_cs$linelist)
## [1] 271

Accessing IDs

The get_id() function retrieves information on case IDs in the dataset, and can be parameterized as follows:

  • linelist: IDs in the line list data
  • contacts: IDs in the contact dataset (“from” and “to” combined)
  • from: IDs in the “from” column of contact datset
  • to IDs in the “to” column of contact dataset
  • all: IDs that appear anywhere in either dataset
  • common: IDs that appear in both contacts dataset and line list

For example, what are the first ten IDs in the contacts dataset?

contacts_ids <- get_id(epic, "contacts")
head(contacts_ids, n = 10)
##  [1] "2ae019" "20b688" "f547d6" "11f8ea" "aec8ec" "4b38b7" "53da57" "e02f66" "893f25" "5387a2"

How many IDs are found in both the linelist and the contacts?

length(get_id(epic, "common"))
## [1] 4352

Visualization

Basic plotting

All visualisations of epicontacts objects are handled by the plot function. We will first filter the epicontacts object to include only the cases with onset dates in June 2014 using the subset function, and only include the contacts linked to those cases using the thin function.

## subset epicontacts object
sub <- epic %>%
  subset(
    node_attribute = list(date_onset = c(as.Date(c("2014-06-30", "2014-06-01"))))
  ) %>%
 thin("contacts")

We can then create the basic, interactive plot very simply as follows:

## plot epicontacts object
plot(
  sub,
  width = 700,
  height = 700
)

You can move the nodes around by dragging them, hover over them for more information and click on them to highlight connected cases.

There are a large number of arguments to further modify this plot. We will cover the main ones here, but check out the documentation via ?vis_epicontacts (the function called when using plot on an epicontacts object) to get a full description of the function arguments.

Visualising node attributes

Node color, node shape and node size can be mapped to a given column in the linelist using the node_color, node_shape and node_size arguments. This is similar to the aes syntax you may recognise from ggplot2.

The specific colors, shapes and sizes of nodes can be specified as follows:

  • Colors via the col_pal argument, either by providing a name list for manual specification of each color as done below, or by providing a color palette function such as colorRampPalette(c("black", "red", "orange")), which would provide a gradient of colours between the ones specified.

  • Shapes by passing a named list to the shapes argument, specifying one shape for each unique element in the linelist column specified by the node_shape argument. See codeawesome for available shapes.

  • Size by passing a size range of the nodes to the size_range argument.

Here an example, where color represents the outcome, shape the gender and size the age:

plot(
  sub, 
  node_color = "outcome",
  node_shape = "gender",
  node_size = 'age',
  col_pal = c(Death = "firebrick", Recover = "green"),
  shapes = c(f = "female", m = "male"),
  size_range = c(40, 60),
  height = 700,
  width = 700
)

Visualising edge attributes

Edge color, width and linetype can be mapped to a given column in the contacts dataframe using the edge_color, edge_width and edge_linetype arguments. The specific colors and widths of the edges can be specified as follows:

  • Colors via the edge_col_pal argument, in the same manner used for col_pal.

  • Widths by passing a size range of the nodes to the width_range argument.

Here an example:

plot(
  sub, 
  node_color = "outcome",
  node_shape = "gender",
  node_size = 'age',
  col_pal = c(Death = "firebrick", Recover = "green"),
  shapes = c(f = "female", m = "male"),
  size_range = c(40, 60),
  edge_color = 'location',
  edge_linetype = 'location',
  edge_width = 'duration',
  edge_col_pal = c(Community = "orange", Nosocomial = "purple"),
  width_range = c(1, 3),
  height = 700,
  width = 700
)

Temporal axis

We can also visualise the network along a temporal axis by mapping the x_axis argument to a column in the linelist. In the example below, the x-axis represents the date of symptom onset. We have also specified the arrow_size argument to ensure the arrows are not too large, and set label = FALSE to make the figure less cluttered.

plot(
  sub,
  x_axis = "date_onset",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

There are a large number of additional arguments to futher specify how this network is visualised along a temporal axis, which you can check out via ?vis_temporal_interactive (the function called when using plot on an epicontacts object with x_axis specified). We’ll go through some below.

Specifying transmission tree shape

There are two main shapes that the transmission tree can assume, specified using the network_shape argument. The first is a branching shape as shown above, where a straight edge connects any two nodes. This is the most intuitive representation, however can result in overlapping edges in a densely connected network. The second shape is rectangle, which produces a tree resembling a phylogeny. For example:

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

Each case node can be assigned a unique vertical position by toggling the position_dodge argument. The position of unconnected cases (i.e. with no reported contacts) is specified using the unlinked_pos argument.

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  position_dodge = TRUE,
  unlinked_pos = "bottom",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

The position of the parent node relative to the children nodes can be specified using the parent_pos argument. The default option is to place the parent node in the middle, however it can be placed at the bottom (parent_pos = 'bottom') or at the top (parent_pos = 'top').

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  parent_pos = "top",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

Saving plots and figures

You can save a plot as an interactive, self-contained html file with the visSave function from the VisNetwork package:

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  parent_pos = "top",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
) %>%
  visNetwork::visSave("network.html")

Saving these network outputs as an image is unfortunately less easy and requires you to save the file as an html and then take a screenshot of this file using the webshot package. In the code below, we are converting the html file saved above into a PNG:

webshot(url = "network.html", file = "network.png")

Timelines

You can also case timelines to the network, which are represented on the x-axis of each case. This can be used to visualise case locations, for example, or time to outcome. To generate a timeline, we need to create a data.frame of at least three columns indicating the case ID, the start date of the “event” and the end of date of the “event”. You can also add any number of other columns which can then be mapped to node and edge properties of the timeline. In the code below, we generate a timeline going from the date of symptom onset to the date of outcome, and keep the outcome and hospital variables which we use to define the node shape and colour. Note that you can have more than one timeline row/event per case, for example if a case is transferred between multiple hospitals.

## generate timeline
timeline <- linelist %>%
  transmute(
    id = case_id,
    start = date_onset,
    end = date_outcome,
    outcome = outcome,
    hospital = hospital
  )

We then pass the timeline element to the timeline argument. We can map timeline attributes to timeline node colours, shapes and sizes in the same way defined in previous sections, except that we have two nodes: the start and end node of each timeline, which have seperate arguments. For example, tl_start_node_color defines which timeline column is mapped to the colour of the start node, while tl_end_node_shape defines which timeline column is mapped to the shape of the end node. We can also map colour, width, linetype and labels to the timeline edge via the tl_edge_* arguments.

See ?vis_temporal_interactive (the function called when plotting an epicontacts object) for detailed documentation on the arguments. Each argument is annotated in the code below too:

## define shapes
shapes <- c(
  f = "female",
  m = "male",
  Death = "user-times",
  Recover = "heartbeat",
  "NA" = "question-circle"
)

## define colours
colours <- c(
  Death = "firebrick",
  Recover = "green",
  "NA" = "grey"
)

## make plot
plot(
  sub,
  ## max x coordinate to date of onset
  x_axis = "date_onset",
  ## use rectangular network shape
  network_shape = "rectangle",
  ## mape case node shapes to gender column
  node_shape = "gender",
  ## we don't want to map node colour to any columns - this is important as the
  ## default value is to map to node id, which will mess up the colour scheme
  node_color = NULL,
  ## set case node size to 30 (as this is not a character, node_size is not
  ## mapped to a column but instead interpreted as the actual node size)
  node_size = 30,
  ## set transmission link width to 4 (as this is not a character, edge_width is
  ## not mapped to a column but instead interpreted as the actual edge width)
  edge_width = 4,
  ## provide the timeline object
  timeline = timeline,
  ## map the shape of the end node to the outcome column in the timeline object
  tl_end_node_shape = "outcome",
  ## set the size of the end node to 15 (as this is not a character, this
  ## argument is not mapped to a column but instead interpreted as the actual
  ## node size)
  tl_end_node_size = 15,
  ## map the colour of the timeline edge to the hospital column
  tl_edge_color = "hospital",
  ## set the width of the timeline edge to 2 (as this is not a character, this
  ## argument is not mapped to a column but instead interpreted as the actual
  ## edge width)
  tl_edge_width = 2,
  ## map edge labels to the hospital variable
  tl_edge_label = "hospital",
  ## specify the shape for everyone node attribute (defined above)
  shapes = shapes,
  ## specify the colour palette (defined above)
  col_pal = colours,
  ## set the size of the arrow to 0.5
  arrow_size = 0.5,
  ## use two columns in the legend
  legend_ncol = 2,
  ## set font size
  font_size = 15,
  ## define formatting for dates
  date_labels = c("%d %b %Y"),
  ## don't plot the ID labels below nodes
  label = FALSE,
  ## specify height
  height = 1000,
  ## specify width
  width = 1200,
  ## ensure each case node has a unique y-coordinate - this is very important
  ## when using timelines, otherwise you will have overlapping timelines from
  ## different cases
  position_dodge = TRUE
)
## Warning in assert_timeline(timeline, x, x_axis): 5863 timeline row(s) removed as ID not found in linelist or start/end date is NA

Analysis

Summarising

We can get an overview of some of the network properties using the summary function.

## summarise epicontacts object
summary(epic)
## 
## /// Overview //
##   // number of unique IDs in linelist: 5888
##   // number of unique IDs in contacts: 5511
##   // number of unique IDs in both: 4352
##   // number of contacts: 3800
##   // contacts with both cases in linelist: 56.868 %
## 
## /// Degrees of the network //
##   // in-degree summary:
##            Min.         1st Qu.          Median            Mean         3rd Qu.            Max. 
## 0.0000000000000 0.0000000000000 1.0000000000000 0.5392365545622 1.0000000000000 1.0000000000000 
## 
##   // out-degree summary:
##            Min.         1st Qu.          Median            Mean         3rd Qu.            Max. 
## 0.0000000000000 0.0000000000000 0.0000000000000 0.5392365545622 1.0000000000000 6.0000000000000 
## 
##   // in and out degree summary:
##           Min.        1st Qu.         Median           Mean        3rd Qu.           Max. 
## 0.000000000000 1.000000000000 1.000000000000 1.078473109124 1.000000000000 7.000000000000 
## 
## /// Attributes //
##   // attributes in linelist:
##  generation date_infection date_onset date_hospitalisation date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat infector source wt_kg ht_cm ct_blood fever chills cough aches vomit temp time_admission bmi days_onset_hosp
## 
##   // attributes in contacts:
##  location duration

For example, we can see that only 57% of contacts have both cases in the linelist; this means that the we do not have linelist data on a significant number of cases involved in these transmission chains.

Pairwise characteristics

The get_pairwise() function allows processing of variable(s) in the line list according to each pair in the contact dataset. For the following example, date of onset of disease is extracted from the line list in order to compute the difference between disease date of onset for each pair. The value that is produced from this comparison represents the serial interval (si).

si <- get_pairwise(epic, "date_onset")   
summary(si)
##           Min.        1st Qu.         Median           Mean        3rd Qu.           Max.           NA's 
##  0.00000000000  5.00000000000  9.00000000000 10.92364645997 15.00000000000 99.00000000000           1639
tibble(si = si) %>%
  ggplot(aes(si)) +
  geom_histogram() +
  labs(
    x = "Serial interval",
    y = "Frequency"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1639 rows containing non-finite values (stat_bin).

The get_pairwise() will interpret the class of the column being used for comparison, and will adjust its method of comparing the values accordingly. For numbers and dates (like the si example above), the function will subtract the values. When applied to columns that are characters or categorical, get_pairwise() will paste values together. Because the function also allows for arbitrary processing (see “f” argument), these discrete combinations can be easily tabulated and analyzed.

head(get_pairwise(epic, "gender"), n = 10)
##  [1] "m -> m" "m -> f" "f -> m" "m -> m" NA       "f -> f" NA       "f -> f" "m -> f" "m -> f"
get_pairwise(epic, "gender", f = table)
##            values.to
## values.from   f   m
##           f 467 506
##           m 518 455
fisher.test(get_pairwise(epic, "gender", f = table))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  get_pairwise(epic, "gender", f = table)
## p-value = 0.02336245432724
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6757687121847981 0.9725031772435068
## sample estimates:
##         odds ratio 
## 0.8107713218632261

Here, we see a significant association between transmission links and gender.

Identifying clusters

The get_clusters() function can be used for to identify connected components in an epicontacts object. First, we use it to retrieve a data.frame containing the cluster information:

clust <- get_clusters(epic, output = "data.frame")
table(clust$cluster_size)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 1536 1680 1182  784  545  342  308  208  171  100   99   24   26   42
ggplot(clust, aes(cluster_size)) +
  geom_bar() +
  labs(
    x = "Cluster size",
    y = "Frequency"
  )

Let us look at the largest clusters. For this, we add cluster information to the epicontacts object and then subset it to keep only the largest clusters:

epic <- get_clusters(epic)
max_size <- max(epic$linelist$cluster_size)
plot(subset(epic, cs = max_size))

Calculating degrees

The degree of a node corresponds to its number of edges or connections to other nodes. get_degree() provides an easy method for calculating this value for epicontacts networks. A high degree in this context indicates an individual who was in contact with many others. The type argument indicates that we want to count both the in-degree and out-degree, the only_linelist argument indicates that we only want to calculate the degree for cases in the linelist.

deg_both <- get_degree(epic, type = "both", only_linelist = TRUE)

Which individuals have the ten most contacts?

head(sort(deg_both, decreasing = TRUE), 10)
## 916d0a 858426 6833d7 f093ea 11f8ea 02d8fd 50fb75 c8c4d5 a127a7 71577a 
##      7      6      6      6      5      5      5      5      5      5

What is the mean number of contacts?

mean(deg_both)
## [1] 1.07847310912445

Resources

The epicontacts page provides an overview of the package functions and includes some more in-depth vignettes.

The github page can be used to raise issues and request features.

Phylogenetic trees

Phylogenetic trees are used to visualize and describe the relatedness and evolution of organisms based on the sequence of their genetic code. They can be constructed from genetic sequences using distance-based methods (such as neighbor-joining method) or character-based methods (such as maximum likelihood and Bayesian Markov Chain Monte Carlo method). Next-generation sequencing (NGS) has become more affordable and is becoming more widely used in public health to describe pathogens causing infectious diseases. Portable devices decrease the turn around time and make data available for the support of outbreak investigation in real-time. NGS data can be used to identify the origin or source of an outbreak strain and its propagation, as well as determine presence of antimicrobial resistance genes. To visualize the genetic relatedness between samples a phylogenetic tree is constructed. In this page we will learn how to use the ggtree() package, which allows for combination of phylogenetic trees with additional sample data in form of a dataframe in order to help observe patterns and improve understanding of the outbreak dynamic.

Preparation

This code chunk shows the loading of required packages:

# load/install packages
pacman::p_load(here, ggplot2, dplyr, ape, ggtree, treeio, ggnewscale)

There are several different formats in which a phylogenetic tree can be stored (eg. Newick, NEXUS, Phylip). A common one, which we will also use here in this example is the Newick file format (.nwk), which is the standard for representing trees in computer-readable form. Which means, an entire tree can be expressed in a string format such as “((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);” listing all nodes and tips and their relationship (branch length) to each other.

It is important to understand that the phylogenetic tree file in itself does not contain sequencing data, but is merely the result of the distances between the sequences. We therefore cannot extract sequencing data from a tree file.

We use the ape() package to import a phylogenetic tree file and store it in a list object of class “phylo”. We inspect our tree object and see it contains 299 tips (or samples) and 236 nodes.

# read in the tree: we use the here package to specify the location of our R project and data files:
tree <- ape::read.tree(here::here("data", "Shigella_tree.nwk"))

# inspect the tree file:
tree
## 
## Phylogenetic tree with 299 tips and 236 internal nodes.
## 
## Tip labels:
##   SRR5006072, SRR4192106, S18BD07865, S18BD00489, S17BD08906, S17BD05939, ...
## Node labels:
##   17, 29, 100, 67, 100, 100, ...
## 
## Rooted; includes branch lengths.

Second we import a table with additional information for each sequenced sample such as gender, country of origin and attributes for antimicrobial resistance:

# We read in a csv file into a dataframe format:
sample_data <- read.csv("sample_data_Shigella_tree.csv", sep = ",", na.strings = c("NA"), head = TRUE, stringsAsFactors=F)

Below are the first 30 rows of these data:

We clean and inspect our data: In order to assign the correct sample data to the phylogenetic tree, the Sample_IDs in the sample_data file need to match the tip.labels in the tree file:

# We clean the data: we select certain columns to be protected from cleaning in order to main tain their formating (eg. for the sample names, as they have to match the names in the phylogenetic tree file)
#sample_data <- linelist::clean_data(sample_data, protect = c(1, 3:5)) 

# We check the formatting of the tip labels in the tree file: 

head(tree$tip.label) # these are the sample names in the tree - we inspect the first 6 with head()
## [1] "SRR5006072" "SRR4192106" "S18BD07865" "S18BD00489" "S17BD08906" "S17BD05939"
# We make sure the first column in our dataframe are the Sample_IDs:
colnames(sample_data)   
##  [1] "Sample_ID"                  "serotype"                   "Country"                    "Continent"                 
##  [5] "Travel_history"             "Year"                       "Belgium"                    "Source"                    
##  [9] "Gender"                     "gyrA_mutations"             "macrolide_resistance_genes" "MIC_AZM"                   
## [13] "MIC_CIP"
# We look at the sample_IDs in the dataframe to make sure the formatting is the same than in the tip.labels (eg. letters are all capital, no extra _ between letters and numbers etc.)
head(sample_data$Sample_ID) # we inspect only the first 6 using head()
## [1] "S17BD05944" "S15BD07413" "S18BD07247" "S19BD07384" "S18BD07338" "S18BD02657"

Upon inspection we can see that the format of sample_ID in the dataframe corresponds to the format of sample names at the tree tips. These do not have to be sorted in the same order to be matched.

We are ready to go!

Simple tree visualization

Different tree layouts

ggtree() offers many different layout formats and some may be more suitable for your specific purpose than others:

# Examples:
ggtree(tree) # most simple linear tree
ggtree(tree,  branch.length = "none") # most simple linear tree with all tips aligned
ggtree(tree, layout="circular") # most simple circular tree
ggtree(tree, layout="circular", branch.length = "none") # most simple circular tree with all tips aligned

# for other options see online: http://yulab-smu.top/treedata-book/chapter4.html

Simple tree with addition of sample data

The most easy annotation of your tree is the addition of the sample names at the tips, as well as coloring of tip points and if desired branches:

# A: Plot Circular tree:
ggtree(tree, layout="circular", branch.length='none') %<+% sample_data + # the %<+% is used to add your dataframe with sample data to the tree
  aes(color=I(Belgium))+     # color the branches according to a variable in your dataframe
  scale_color_manual(
    name = "Sample Origin",  # name of your color scheme (will show up in the legend like this)
    breaks = c("Yes", "No"), # the different options in your variable
    labels = c("NRCSS Belgium", "Other"), # how you want the different options named in your legend, allows for formatting
    values= c("blue", "grey"),            # the color you want to assign to the variable if its "nrc_bel"
    na.value="grey")+     # for the NA values we choose the color grey
  new_scale_color()+      # allows to add an additional color scheme for another variable
     geom_tippoint(       # color the tip point by continent, you may change shape adding "shape = "
       aes(color=Continent),
       size=1.5)+ 
  scale_color_brewer(
    name = "Continent",   # name of your color scheme (will show up in the legend like this)
    palette="Set1",       # we choose a premade set of colors coming with the brewer package
    na.value="grey")+     # for the NA values we choose the color grey
  geom_tiplab(            # add the name of the sample to the tip of its branch (you can add as many text lines as you like with the + , you just need to change the offset value to place them next to each other)
    color='black',
    offset = 1,
    size = 1,
    geom = "text",
    align=TRUE)+ 
  ggtitle("Phylogenetic tree of Shigella sonnei")+ # title of your graph
  theme(
    axis.title.x=element_blank(), # removes x-axis title
    axis.title.y=element_blank(), # removes y-axis title
    legend.title=element_text(face="bold", size =12),  # defines font size and format of the legend title
    legend.text=element_text(face="bold", size =10),   # defines font size and format of the legend text
    plot.title = element_text(size =12, face="bold"),  # defines font size and format of the plot title
    legend.position="bottom", # defines placement of the legend
    legend.box="vertical",
    legend.margin=margin())   # defines placement of the legend

# Export your tree graph:
ggsave(here::here("images", "example_tree_circular_1.png"), width = 12, height = 14)

Manipulation of phylogenetic trees

Sometimes you may have a very large phylogenetic tree and you are only interested in one part of the tree. For example if you produced a tree including historical or international samples to get a large overview of where your dataset might fit in in the bigger picture. But then to look closer at your data you want to inspect only that portion of the bigger tree.

Since the phylogenetic tree file is just the output of sequencing data analysis, we can not manipulate the order of the nodes and branches in the file itself. These have already been determined in previous analysis from the raw NGS data. We are able though to zoom into parts, hide parts and seven subset part of the tree.

Zooming in on one part of the tree:

If you don’t want to “cut” your tree, but only inspect part of it more closely you can zoom in to view a specific part:

# First we plot the whole tree:
p <- ggtree(tree,) %<+% sample_data +
  geom_tiplab(size =1.5) + # labels the tips of all branche with the sample name in the tree file
  geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust=1, vjust =1) # labels all the nodes in the tree
p

We want to zoom into the branch which is sticking out, after node number 452 to get a closer look:

viewClade(p , node=452)

Collapsing one part of the tree:

The other way around we may want to ignore this branch which is sticking out and can do so by collapsing it at the node (indicated here by the blue square):

#First we collapse at node 452
p_collapsed <- collapse(p, node=452)

#To not forget that we collapsed this node we assign a symbol to it:
p_collapsed + geom_point2(aes(subset=(node == 452)), size=5, shape=23, fill="steelblue")

Subsetting a tree

If we want to make a more permanent change and create a new tree to work with we can subset part of it and even save it as new newick tree file.

# To do so you can add the node and tip labels to your tree to see which part you want to subset:
ggtree(tree, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =1) +       # labels the tips of all branche with the sample name in the tree file
  geom_text2(                  # labels all the nodes in the tree
    aes(subset = !isTip, label=node),
    size = 3,
    color = "darkred")+  
 theme(
   legend.position = "none", # removes the legend all together
   axis.title.x = element_blank(),
   axis.title.y=element_blank(),
   plot.title = element_text(size =12, face="bold"))

# A: Subset tree based on node:
sub_tree1 <- tree_subset(
  tree,
  node = 528) # we subset the tree at node 528

# lets have a look at the subset tree:
ggtree(sub_tree1)+
  geom_tiplab(size = 3) +
  ggtitle("Subset tree 1")

# B: Subset the same part of the tree based on a samplem in this case S17BD07692:
sub_tree2 <- tree_subset(
  tree,
  "S17BD07692",
  levels_back = 9) # levels back defines how many nodes backwards from the sample tip you want to go

# lets have a look at the subset tree:
ggtree(sub_tree2)+
  geom_tiplab(size = 3)+
  ggtitle("Subset tree 2")

You can also save your new tree as a Newick file:

ape::write.tree(sub_tree2, file='data/Shigelle_subtree_2.nwk')

Rotating nodes in a tree

As mentioned before we cannot change the order of tips or nodes in the tree, as this is based on their genetic relatedness and is not subject to visual manipulation. But we can rote branches around nodes if that eases our visualization.

First we plot our new subsetted tree with node labels to choose the node we want to manipulate:

p <- ggtree(sub_tree2) +
  geom_tiplab(size = 4) +
  geom_text2(                       # label all the nodes in the tree
    aes(subset=!isTip, label=node),
    size = 5,
    color = "darkred",
    hjust = 1,
    vjust = 1) 
p

We choose to manipulate node number 39: we do so by applying ggtree::rotate() or ggtree::flip() indirectly to node 36 so node 39 moves to the bottom and nodes 37 and 38 move to the top:

# 
# p1 <- p + geom_hilight(39, "steelblue", extend =0.0015)+ # highlights the node 39 in blue
#    geom_hilight(37, "yellow", extend =0.0015)  + # highlights the node 37 in yellow
#   ggtitle("Original tree")
# 
# # we want to rotate node 36 so node 39 is on the bottom and nodes 37 and 38 move to the top:
# # 
# rotate(p1, 39) %>% rotate(37)+
#   ggtitle("Rotated Node 36")
# 
# # #or we can use the flip command to achieve the same thing:
# flip(p1, 39, 37)

Example subtree with sample data annotation:

Lets say we are investigating the cluster of cases with clonal expansion which occured in 2017 and 2018 at node 39 in our sub-tree. We add the year of strain isolation as well as travel history and color by country to see origin of other closely related strains:

# Add sample data:
ggtree(sub_tree2) %<+% sample_data + 
  geom_tiplab(                      # labels the tips of all branches with the sample name in the tree file
    size =2.5,
    offset = 0.001, 
    align = TRUE) + 
  theme_tree2()+
  xlab("genetic distance")+ # add a label to the x-azis
  xlim(0, 0.015)+           # set the x-axis limits of our tree
  geom_tippoint(            # color the tip point by continent
    aes(color=Country),
    size=1.5)+ 
  scale_color_brewer(
    name = "Country", 
    palette="Set1", 
    na.value="grey")+
  geom_tiplab(              # add isolation year
    aes(label = Year),
    color='blue',
    offset = 0.0045,
    size = 3,
    linetype = "blank",
    geom = "text", 
    align=TRUE)+
  geom_tiplab(              # add travel history
    aes(label = Travel_history),
    color='red',
    offset = 0.006,
    size = 3,
    linetype = "blank",
    geom = "text",
    align=TRUE)+ 
  ggtitle("Phylogenetic tree of Belgian S. sonnei strains with travel history")+ # add plot title
  theme(
    axis.title.x=element_blank(),
    axis.title.y=element_blank(),
    legend.title=element_text(face="bold", size =12),
    legend.text=element_text(face="bold", size =10),
    plot.title = element_text(size =12, face="bold"))

Our observation points towards an import of strains from Asia, which then circulated in Belgium over the years and seem to have caused our latest outbreak.

More complex trees: adding heatmaps of sample data

We can add more complex information, such as categorical presence of antimicrobial resistance genes and numeric values for actually measured resistance to antimicrobials in form of a heatmap using the ggtree::gheatmap() function.

First we need to plot our tree (this can be either linear or circular): We will use the sub_stree from part 3.)

# A: Circular tree:
p <- ggtree(sub_tree2, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =3) + 
  theme(
    legend.position = "none",
    axis.title.x=element_blank(),
    axis.title.y=element_blank(),
    plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))

p

Second we prepare our data. To visualize different variables with new color schemes, we subset our dataframe to the desired variable.

For example we want to look at gender and mutations that could confer resistance to ciprofloxacin:

# Create your gender dataframe:
gender <- data.frame("gender" = sample_data[,c("Gender")])

# Its important to add the Sample_ID as rownames otherwise it cannot match the data to the tree tip.labels:
rownames(gender) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on mutations in the gyrA gene:
cipR <- data.frame("cipR" = sample_data[,c("gyrA_mutations")])
rownames(cipR) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on the measured minimum inhibitory concentration (MIC) from the laboratory:
MIC_Cip <- data.frame("mic_cip" = sample_data[,c("MIC_CIP")])
rownames(MIC_Cip) <- sample_data$Sample_ID

We create a first plot adding a binary heatmap for gender to the phylogenetic tree:

# First we add gender:
h1 <-  gheatmap(
  p,
  gender,
  offset = 12,        # offset shifts the heatmap to the right
  width=0.10,         # width defines the width of the heatmap column
  color=NULL,         # color defines the border of the heatmap columns
  colnames = FALSE)+  # hides column names for the heatmap
  scale_fill_manual(  # define the coloring scheme and legend for gender
    name = "Gender", 
    values = c("#00d1b1", "purple"),
    breaks = c("Male", "Female"),
    labels = c("Male", "Female"))+
  theme(
    legend.position="bottom",
    legend.title = element_text(size=12),
    legend.text = element_text(size =10),
    legend.box="vertical", legend.margin=margin())
h1

Then we add information on ciprofloxacin resistance genes:

# First we assigng a new color scheme to our existing plot, this enables us to define and change the colors for our second variable
h2 <- h1 + new_scale_fill() 

# then we combine these into a new plot:
h3 <- gheatmap(
  h2,
  cipR,
  offset = 14,
  width=0.10, # adds the second row of heatmap describing ciprofloxacin resistance genes
  colnames = FALSE)+
  scale_fill_manual(
    name = "Ciprofloxacin resistance \n conferring mutation",
    values = c("#fe9698","#ea0c92"),
    breaks = c( "gyrA D87Y", "gyrA S83L"),
    labels = c( "gyrA d87y", "gyrA s83l"))+
  theme(
    legend.position="bottom",
    legend.title = element_text(size=12),
    legend.text = element_text(size =10),
    legend.box = "vertical",
    legend.margin = margin())+
  guides(
    fill = guide_legend(nrow=2,byrow=TRUE))

h3

Next we add continuous data on actual resistance determined by the laboratory as the minimum inhibitory concentration (MIC) of ciprofloxacin:

# First we add the new coloring scheme:
h4 <- h3 + new_scale_fill()

# then we combine the two into a new plot:
h5 <- gheatmap(
  h4,
  MIC_Cip,
  offset = 16,
  width=0.10,
  colnames = FALSE)+
  scale_fill_continuous(
    name = "MIC for ciprofloxacin",
    low = "yellow",
    high = "red",
    breaks = c(0, 0.50, 1.00),
    na.value = "white")+
  guides(
    fill = guide_colourbar(barwidth = 5, barheight = 1))+
  theme(
    legend.position="bottom",
    legend.title = element_text(size=12),
    legend.text = element_text(size =10),
    legend.box="vertical",
    legend.margin=margin())
h5

We can do the same exercise for a linear tree:

# B: Lineartree:
p <- ggtree(sub_tree2) %<+% sample_data +
  geom_tiplab(size =3) + # labels the tips
  theme_tree2()+
  xlab("genetic distance")+
  xlim(0, 0.015)+
 theme(
   legend.position = "none",
   axis.title.y=element_blank(),
   plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))


# First we add gender:

h1 <- gheatmap(
  p, gender,
  offset = 0.003,
  width=0.1,
  color="black", 
  colnames = FALSE)+
  scale_fill_manual(
    name = "Gender",
    values = c("#00d1b1", "purple"),
    breaks = c("Male", "Female"),
    labels = c("Male", "Female"))+
   theme(
     legend.position="bottom",
     legend.title = element_text(size=12),
     legend.text = element_text(size =10),
     legend.box="vertical", legend.margin=margin())
# h1

# Then we add ciprofloxacin after adding another colorscheme layer:

h2 <- h1 + new_scale_fill()
h3 <- gheatmap(
  h2, cipR,
  offset = 0.004,
  width=0.1,
  color="black",
  colnames = FALSE)+
  scale_fill_manual(
    name = "Ciprofloxacin resistance \n conferring mutation",
    values = c("#fe9698","#ea0c92"),
    breaks = c( "gyrA D87Y", "gyrA S83L"),
    labels = c( "gyrA d87y", "gyrA s83l"))+
  theme(
    legend.position="bottom",
    legend.title = element_text(size=12),
    legend.text = element_text(size =10),
    legend.box="vertical",
    legend.margin=margin())+
  guides(
    fill=guide_legend(nrow=2,byrow=TRUE))
# h3

# Then we add the minimum inhibitory concentration determined by the lab (MIC):
h4 <- h3 + new_scale_fill()
h5 <- gheatmap(
  h4, MIC_Cip,
  offset = 0.005,
  width=0.1,
  color="black",
  colnames = FALSE)+
  scale_fill_continuous(
    name = "MIC for ciprofloxacin",
    low = "yellow",
    high = "red",
    breaks = c(0,0.50,1.00),
    na.value = "white")+
   guides(
     fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(
     legend.position="bottom",
     legend.title = element_text(size=10),
     legend.text = element_text(size =8),
     legend.box="horizontal", legend.margin=margin())+
  guides(
    shape = guide_legend(override.aes = list(size = 2)))
h5

Interactive plots

Data visualisation is increasingly required to be interrogable by the audience. Consequently creating interactive plots are becoming common. There are several ways to include these but the two most important are {plotly} and {shiny}.

{Shiny} is covered in another part of this handbook, so we will only cover {plotly} here. #TODO - link to shiny page

Overview

Making plots interactive can sound more difficult than it turns out to be, thanks to some fantastic tools.

In this section, you’ll learn to easily make a plot interactive with {the wonders {ggplot2} and {plotly}

Preparation

In the example you saw a very basic epicurve that had been transformed to bbe interactive using the fantastic {ggplot2} - {plotly} integrations. So to start, make a basic chart of your own:

Loading data

linelist <- rio::import("linelist_cleaned.xlsx")

Manipulate and add columns (best taught in the epicurves section)

linelist <- linelist %>% 
  dplyr::mutate(
    ## If the outcome column is NA, change to "Unknown"
    outcome = dplyr::if_else(condition = is.na(outcome),
                             true = "Unknown",
                             false = outcome),
    ## If the date of infection is NA, use the date of onset instead
    date_earliest = dplyr::if_else(condition = is.na(date_infection),
                                   true = date_onset,
                                   false = date_infection),
    ## Summarise earliest date to earliest week 
    week_earliest = lubridate::floor_date(x = date_earliest,
                                          unit = "week",
                                          week_start = 1)
    )

Count for plotting

## Find number of cases in each week by their outcome
linelist <- linelist %>% 
  dplyr::count(week_earliest, outcome)

Plot

Make into a plot

p <- linelist %>% 
  ggplot()+
  geom_col(aes(week_earliest, n, fill = outcome))+
  xlab("Week of infection/onset") + ylab("Cases per week")+
  theme_minimal()

Make interactive

p <- p %>% 
  plotly::ggplotly()

Voila!

p

Modifications

When exporting in an Rmarkdown generated HTML (like this book!) you want to make the plot as small as possible (with no negative side effects in most cases). For this, just add add this line:

p <- p %>% 
  plotly::partial_bundle()

Some of the buttons on a standard plotly (as shown on the preparation tab) are superfluous and can be distracting, so it’s best to remove them. You can do this simply by piping the output into plotly::config

## these buttons are superfluous/distracting
plotly_buttons_remove <- list('zoom2d','pan2d','lasso2d', 'select2d','zoomIn2d',
                              'zoomOut2d','autoScale2d','hoverClosestCartesian',
                              'toggleSpikelines','hoverCompareCartesian')

p <- p %>% 
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Example

Earlier you saw #TODO link to heatmaps how to make heatmaps, and they are just as easy to make interactive.

## `summarise()` has grouped output by 'location_name'. You can override using the `.groups` argument.
## Joining, by = c("location_name", "week")
metrics_plot %>% 
  ggplotly() %>% 
  partial_bundle() %>% 
  config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Maps - preparation

You can even make interactive maps! However, they’re slightly trickier. Although {plotly} works well with ggplot2::geom_sf in RStudio, when you try to include it’s outputs in Rmarkdown HTML files (like this book), it doesn’t work well.

So instead you can use {plotly}’s own mapping tools which can be tricky but are easy when you know how. Read on…

We’re going to use Covid-19 incidence across African countries for this example. The data used can be found on the World Health Organisation website.

You’ll also need a new type of file, a GeoJSON, which is sort of similar to a shp file for those familiar with GIS. For this book, we used one from here.

GeoJSON files are stored in R as complex lists and you’ll need to maipulate them a little.

## You need two new packages: {rjson} and {purrr}
pacman::p_load(plotly, rjson, purrr)

## This is a simplified version of the WHO data
df <- rio::import(here::here("data", "covid_incidence.csv"))

## Load your geojson file
geoJSON <- rjson::fromJSON(file=here::here("data", "africa_countries.geo.json"))

## Here are some of the properties for each element of the object
head(geoJSON$features[[1]]$properties)
## $scalerank
## [1] 1
## 
## $featurecla
## [1] "Admin-0 country"
## 
## $labelrank
## [1] 6
## 
## $sovereignt
## [1] "Burundi"
## 
## $sov_a3
## [1] "BDI"
## 
## $adm0_dif
## [1] 0

This is the tricky part. For {plotly} to match your incidence data to GeoJSON, the countries in the geoJSON need an id in a specific place in the list of lists. For this we need to build a basic function:

## The property column we need to choose here is "sovereignt" as it is the names for each country
give_id <- function(x){
  
  x$id <- x$properties$sovereignt  ## Take sovereignt from properties and set it as the id
  
  return(x)
}

## Use {purrr} to apply this function to every element of the features list of the geoJSON object
geoJSON$features <- purrr::map(.x = geoJSON$features, give_id)

Maps - plot

plotly::plot_ly() %>% 
  plotly::add_trace(                    #The main plot mapping functionn
    type="choropleth",
    geojson=geoJSON,
    locations=df$Name,          #The column with the names (must match id)
    z=df$Cumulative_incidence,  #The column with the incidence values
    zmin=0,
    zmax=57008,
    colorscale="Viridis",
    marker=list(line=list(width=0))
  ) %>%
  plotly::colorbar(title = "Cases per million") %>%
  plotly::layout(title = "Covid-19 cumulative incidence",
                 geo = list(scope = 'africa')) %>% 
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Resources

Plotly is not just for R, but also works well with Python (and really any data science language as it’s built in JavaScript). You can read more about it on the plotly website

VI Advanced

Directory interactions

In this page we cover common scenarios where you interact with, save, and import with directories (folders).

Preparation

fs package

The fs package is a tidyverse package that facilitate directory interactions, improving on some of the base R functions. In the sections below we will often use functions from fs.

pacman::p_load(fs)

Accessing files in the directory

Running other files

source()

To run one R script from another R script, you can use the source() command (from base R).

source(here("scripts", "cleaning_scripts", "clean_testing_data.R"))

This is equivalent to viewing the above R script and clicking the “Source” button in the upper-right of the script. This will execute the script but will do it silently (no output to the R console) unless specifically intended. See the page on [Interactive console] for examples of using source() to interact with a user via the R console in question-and-answer mode.

render()

render() is a variation on source() most often used for R markdown scripts. You provide the input = which is the R markdown file, and also the output_format = (typically either “html_document”, “pdf_document”, “word_document”, "")

See the page on R markdown for more details. Also see the documentation for render() here or by entering ?render.

Run files in a directory

You can create a for loop and use it to source() every file in a directory, as identified with dir().

for(script in dir(here("scripts"), pattern = ".R$")) {   # for each script name in the R Project's "scripts" folder (with .R extension)
  source(here("scripts", script))                        # source the file with the matching name that exists in the scripts folder
}

If you only want to run certain scripts, you can identify them by name like this:

scripts_to_run <- c(
     "epicurves.R",
     "demographic_tables.R",
     "survival_curves.R"
)

for(script in scripts_to_run) {
  source(here("scripts", script))
}

Here is a comparison of the fs and base R functions.

Import files in a directory

See the page on Import and export for importing and exporting individual files.
See the page on Iteration and loops for an example with the package purrr demonstrating:

  • Splitting a dataframe and saving it as multiple CSV files
  • Splitting a dataframe and saving each part as a separate sheet within one Excel workbook
  • Importing multiple CSV files and combining them into one dataframe
  • Importing an Excel workbook with multiple sheets and combining them into one dataframe

base R

See below the functions list.files() and dir(), which perform the same operation of listing files within a specified directory. You can specify ignore.case = or a specific patter to look for.

list.files(path = here("data"))

list.files(path = here("data"), pattern = ".csv")
# dir(path = here("data"), pattern = ".csv")

list.files(path = here("data"), pattern = "evd", ignore.case = TRUE)

If a file is currently “open”, it will display with a tilde in front, like “~$hospital_linelists.xlsx”.

R Markdown

R Markdown is a fantastic tool for creating automated, reproducible, and share-worthy outputs. It can generate static or interactive outputs, in the form of html, word, pdf, powerpoint, and others.

Overview

Using markdown will allow you easily recreate an entire formatted document, including tables/figures/text, using new data (e.g. daily surveillance reports) and/or subsets of data (e.g. reports for specific geographies).

This guide will go through the basics. See ‘resources’ tab for further info.

Preparation

Background to Markdown

To explain some of the concepts and packages involved:

  • Markdown is a lightweight markup language, with syntax that allows for plain text formatting so that it can be converted to html and other formats. It is not specific to R, and usually a markdown file has an ‘.md’ extension.
  • R Markdown - the language: This is an extension of markdown that is specific to R, with file extensions ‘.Rmd’. This allows R code to be embedded in ‘chunks’ so that the code itself can be run, rather than just having a text document.
  • Rmarkdown - the package: This is used by R to render the .Rmd file into the desire output. However its focus is the markdown (text) syntax, so we also need…
  • Knitr: This package will read the code chunks, execute it, and ‘knit’ it back into the document. This is how tables and graphs are included alongside the text.
  • Pandoc: Finally, pandoc is needed to actually convert documents into e.g. word/pdf/powerpoint etc. It is separate from R.

The R Studio website describes how these all link in together (https://rmarkdown.rstudio.com/authoring_quick_tour.html):

Creating documents with R Markdown starts with an .Rmd file that contains a combination of markdown (content with simple text formatting) and R code chunks. The .Rmd file is fed to knitr, which executes all of the R code chunks and creates a new markdown (.md) document which includes the R code and its output.

The markdown file generated by knitr is then processed by pandoc which is responsible for creating a finished web page, PDF, MS Word document, slide show, handout, book, dashboard, package vignette or other format.

This may sound complicated, but R Markdown makes it extremely simple by encapsulating all of the above processing into a single render function. Better still, RStudio includes a “Knit” button that enables you to render an .Rmd and preview it using a single click or keyboard shortcut.

Installation

To create R Markdown, you need to have the following installed:

  • The Rmarkdown package, as described above: install.packages('rmarkdown')
  • Pandoc, which should come with RStudio. If you are not using RStudio, you can download it here: http://pandoc.org.
  • If you want to generate PDF output (a bit trickier), you will need to install LaTeX. For R Markdown users who have not installed LaTeX before, we recommend that you install TinyTeX (https://yihui.name/tinytex/):
install.packages('tinytex')
tinytex::install_tinytex()  # install TinyTeX

Workflow

Preparation of an R Markdown workflow involves ensuring you have set up an R project and have a folder structure that suits the desired workflow.

For instance, you may want an ‘output’ folder for your rendered documents, an ‘input’ folder for new cleaned data files, as well as subfolders within them which are date-stamped or reflect the subgeographies of interest. The markdown itself can go in a ‘rmd’ subfolder, particularly if you have multiple Rmd files within the same project.

You can set code up to create output subfolders for you each time you run reports (see “Producing an output”), but you should have the overall design in mind.

Because R Markdown can run into pandoc issues when running on a shared network drive, it is recommended that your folder is on your local machine, e.g. in a project within ‘My Documents’. If you use Git (much recommended!), this will be familiar.

The R Markdown file

An R Markdown document looks like and can be edited just like a standard R script, in R Studio. However, it contains more than just the usual R code and hashed comments. There are three basic components:

1. Metadata: This is referred to as the ‘YAML metadata’ and is at the top of the R Markdown document between two ‘- - -‘s. It will tell your Rmd file what type of output to produce, formatting preferences, and other metadata sucsh as document title, author, and date. There are other uses not mentioned here (but referred to in ‘Producing an output’). Note that indentation matters.

2. Text: This is the narrative of your document, including the titles. It is written in the markdown language, used across many different programmes. This means you can add basic formatting, for instance:

  • _text_ or *text* to italicise
  • **text** for bold text
  • # at the start of a new line for a title (and ## for second-level title, ## for third-level title etc)
  • * at the start of a new line for bullet points
  • text to display text as code (as above)

The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).

You can also include minimal R code within backwards ticks, for within-text values. See example below.

3. Code chunks: This is where the R code goes, for the actual data management and visualisation. To note: These ‘chunks’ will appear to have a slightly different background colour from the narrative part of the document.

Each chunk always starts with three backticks and chunk information within squiggly brackets, and ends with three more backticks.

Some notes about the content of the squiggly brackets:

  • They start with ‘r’ to indicate that the language name within the chunk is r
  • Followed by the chunk name - note this should ALWAYS be a unique name or else R will complain when you try to render.
  • It can include other options too, but many of these can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the squiggly brackets, e.g. ‘echo=FALSE’ if you specify you want to ‘Show output only’.

There are also two arrows at the top right of each chunk, which are useful to run code within a chunk, or all code in prior chunks.

## Producing an output { }

General notes

Everything used by this markdown must be referenced within the Rmd file. For instance, you need to load any required packages or data.

A single or test run from within R Markdown

To render a single document, for instance if you are testing it or if you only need to produce one rendered document at a time, you can do it from within the open R Markdown file. Click the “knit” button" at the top of the document.

The ‘R Markdown’ tab will start processing to show you the overall progress, and a complete document will automatically open when complete. This document will also be saved in the same folder as your markdown, and with the same file name aside from the file extension. This is obviously not ideal for version control, as you will then rename the file yourself.

A single run from an separate script

To run the markdown so that a date-stamped file is produced, you can create a separate script and call the Rmd file from within it. You can also specify the folder and file name, and include a dynamic date and time, so that file will be date stamped on production.

rmarkdown::render(("rmd_reports/create_RED_report.Rmd"),  
                        output_file = paste0("outputs/Report_", Sys.Date, ".docx")) # Use 'paste0' to combine text and code for a dynamic file name

Routine runs into newly created date-stamped sub folders

Add a couple lines of code to define the date you are running the report (e.g. using Sys.Date as in the example above) and create new sub folders. If you want the date to reflect a specific date rather than the current date, you can also enter it as an object.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop
rmarkdown::render(("rmd_reports/create_report.Rmd"),  
                        output_file = paste0(outputfolder, "/Report_", refdate, ".docx")) #Dyanmic folder name now included

You may want some dynamic information to be included in the markdown itself. This is addressed in the next section.

Parametrised reports

Parameterised reports are the next step so that the content of the R Markdown itself can also be dynamic. For example, the title can change according to the subgeography you are running, and the data can filter to that subgeography of interest.

Let’s say you want to run the markdown to produce a report with surveillance data for Area1 and Area2. You will:

  1. Edit your R Markdown:
  1. Change your YAML metadata to include a ‘params’ section, which specifies the dynamic object.
  2. Refer to this parameterised object within the code as needed. E.g. filter(area == params$areanumber) rather than filter(area=="Area1").

For instance (simplified version which does not include setup code such as library/data loading):

You can change the content by editing the YAML as needed, or set up a loop in a separate script to iterate through the areas. As with the previous section, you can set up the folders as well.

As you can see below, you set up a list which includes all areas of interest (arealist), and when rendering the markdown you specify that the parameterized areanumber for a specific iteration is the Nth value of the arealist. For instance, for the first iteration, areanumber will equate to “Area1”. The code below also specifies that the Nth area name will be included in the output file name.

Note that this will work even if an area or date are specified within the YAML itself - that YAML information will get overwritten by the loop.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Set the list (note that this can also be an imported list)
arealist <- c("Area1", "Area2", "Area3", "Area4", "Area5")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop

for(i in 1:length(arealist))  { # This will loop through from the first value to the last value in 'arealist'

rmarkdown::render(here("rmd_reports/create_report.Rmd"), 
                        params = list(areanumber = arealist[1], #Assigns the nth value of arealist to the current areanumber
                                      refdate = refdate),
                        output_file = paste0(outputfolder, "/Report_", arealist[1], refdate, ".docx")) 
                        
}

Routine reports

UNDER CONSTRUCTION

This page will cover the reportfactory package and other tips for routinizing your data flows and reports.

Resources

Errors & warnings

This page lists common errors and suggests solutions for troubleshooting them

Data management errors

No such file or directory:

If you see an error like this when you try to export or import: Check the spelling of the file and filepath, and if the path contains slashes make sure they are forward / and not backward \. Also make sure you used the correct file extension (e.g. .csv, .xlsx).

#Tried to add a value ("Missing") to a factor (with replace_na operating on a factor)
Problem with `mutate()` input `age_cat`.
i invalid factor level, NA generated
i Input `age_cat` is `replace_na(age_cat, "Missing")`.invalid factor level, NA generated

You likely have a column of class Factor (which contains pre-defined levels) and tried to add a new value to it. Convert it to class Character before adding a new value.

Package masked errors

Error in select(data, var) : unused argument (var)

You think you are using dplyr::select() but the select() function has been masked by MASS::select() - specify dplyr:: or re-order your package loading so that dplyr is after all the others.

Other common masking errors stem from: plyr::summarise() and stats::filter(). Consider using the conflicted package.

Plotting errors

# ran recode without re-stating the x variable in mutate(x = recode(x, OLD = NEW)
Error: Problem with `mutate()` input `hospital`.
x argument ".x" is missing, with no default
i Input `hospital` is `recode(...)`.

Error: Insufficient values in manual scale. 3 needed but only 2 provided. ggplot() scale_fill_manual() values = c(“orange”, “purple”) … insufficient for number of factor levels … consider whether NA is now a factor level…

Error: unexpected symbol in:
"  geom_histogram(stat = "identity")+
  tidyquant::geom_ma(n=7, size = 2, color = "red" lty"

If you see “unexpected symbol” check for missing commas

consider whether you re-arranged dplyr verbs and didn’t replace a pipe in the middle, or didn’t remove a pipe from the end.

Can’t add x object … Have a + at the end of a ggplot command that you need to delete.

Advanced RStudio

THIS PAGE IS UNDER CONSTRUCTION

Resources

Relational databases

THIS PAGE IS UNDER CONSTRUCTION

Resources

Shiny and dashboards

THIS PAGE IS UNDER CONSTRUCTION

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Collaboration

Overview

  • Package management
  • Using Github and R

Using Github and R to contribute

Here is an online guide to using Github and R. Some of the below text is adapted from this guide.

Overview of GitHub

Github is a website that supports collaborative projects with version control. In a nutshell, the project’s files exist in the Github repository as a “master” version (called a “branch”). If you want to make a change to those files you must create a different branch (version) to build and test the changes in. Master remains unaffected by your changes until your branch is merged (after some verification steps) into the master branch. A “commit” is the saving of a smaller group of changes you make within your branch. A Pull Request is your request to merge your changes into the master branch.

The way RStudio and Github interact is as follows:

  • There is a REMOTE version of the Epi_R_handbook R project that lives on Github website repository - master and other branches all exist and are viewable on this Github repository. Pull requests, issue tracking, and de-conflicting merges happens online here.
  • On your LOCAL computer, you clone a version of the entire Github repository (all the R project files, from all its branches/versions). Locally, you can make changes to the files of any branch and “commit” those changes (save them with an explanatory note). These changes are only stored locally on your computer until…
  • Your LOCAL repository/Rproject interacts with the REMOTE one by 1) pulling (updating local files from the remote ones of the same branch) and pushing (pushing local changes to the same branch of the remote repository)
  • The software Git on your computer underlies all this, and is used by RStudio. You don’t have to interact with Git except through RStudio. While you can write Git command-line into the RStudio terminal, it is easier to just interact with Git through RStudio point-and-click buttons. As noted below, you may occasionally have to write Git commands in the RStudio terminal.

Image source

First steps

  1. Register for a free account with Github
  2. Have R and RStudio installed/updated
  3. Install Git to your computer (remember Git is a software on your computer accessed by RStudio, Github is a website)
  4. Familiarize yourself to the Github workflow by reading about it
  5. Become a contributor to the Epi_R_handbook Github repository (email )
  6. Clone the Github repository to your computer
    • In RStudio start a new project File > New Project > Version Control > Git
    • In “Repository URL”, paste the URL https://github.com/nsbatra/Epi_R_handbook.git (link also available from repo main page, green “Code” button, HTTPS)
    • Accept the default project directory name Epi_R_handbook
    • Take charge of – or at least notice! – where the Project will be saved locally
    • Check “Open in new session” and click “Create project”
    • You should now be in a new local RStudio project that is a clone of the Epi_R_handbook repository on Github

In your RStudio you will now have a Git tab in the same tab as your R Environment:

Please note the buttons circled as they will be referenced later (from left to right):

  • Button to begin “commiting” your changes to your branch (will open a new window)
  • Arrows to PULL (update your local version of the branch with any changes to made your branch by others) and to PUSH (send any completed commits stored in your local version of the branch to the remote/Github version of your branch)
  • The Git tab in RStudio
  • Button to create a NEW branch of whichever version is listed to the right. You almost always want to branch off of the master (after you PULL to update the master first).
  • The branch you are currently working in.
  • Below all this, changes you make to code or files will begin to appear

To work on your Handbook page:

Note: Last I heard, Github will soon change their terminology of “master” to “main”, as it is an unnecessary reference to slavery

  1. Create a branch
  • Be in master branch and then click the branch button/icon.
  • Name your branch with a one-word descriptive name (can use underscores if needed). You will see that locally, you are still in the project Epi_R_handbook, but you are no longer working on the master branch. Once created, the new branch will also appear on the Github website as a branch.
  • Make your changes… to files, code, etc. Your changes are tracked.
  • Commit the changes. Every series of changes you make that are substantial (e.g. adding or updating a section, etc.), stop and commit those changes. Think of a commit as a “batch” of changes related to a common purpose.
    • Press “Commit” in the git tab, opens new window
    • Review the changes you made (green, red etc.)
    • Highlight all the changes for the commit and “stage” them by checking their boxes or highlighting all the rows and clicking “stage all”
    • Write a commit message that is short but descriptive (required)
    • Press “commit” on the right side
  • Make and commit more changes, as many times as you would like
  • PULL - click the PULL icon (downward arrow) which updates the branch version on your local computer with any changes that have been made to it and stored in the remote/Github version
    • PULL often. Don’t hesitate. Always pull before pushing.
  • PUSH your changes up to the remote/Github version of your branch.
    • You may be asked to enter your Github username and password.
    • The first time you are asked, you may need to enter two Git command lines into the Terminal (the tab next to the R Console):
      • git config –global user.email “ (your Github email address), and
      • git config –global user.name “Your Github username”
  1. Request to merge your branch with master

Once done with your commits and pushed everything up to the remote Github repository, you may want to request that your branch be merged with the master branch.

  • Go to Epi_R_handbook Github repository
  • Use the branch drop-down to view your branch, not master
  • At top you will see green button saying “Compare and Pull Request” for your branch. If not, look for another button that says pull request.
  • Write a detailed comment and click “Create Pull Request”
  • On the right, request a review from members of the project’s core team. You need at least one review to be able to complete the merge.
  • Once completed, delete your branch as explained below
  1. Delete your branch on Github

GO to the repository on Github and click the button to view all the branches (next to the drop-down to select branches). Now find your branch and click the trash icon next to it. Read more here

Be sure to also delete the branch locally on your computer:

  • From RStudio, make sure you are in Master branch
  • Switch to typing in the “terminal” (tab adjacent to the R console), and enter this: git branch -d branch_name , where “branch_name” is the name of your branch to be deleted
  • Refresh your Git tab and this branch should be gone.

TEST IT You can test your ability to make changes, commits, pull requests, etc. by modifying this R script which is saved to the main Rproject folder: test_your_abilities.R

Asked to provide password too often??
Instructions for connecting to the repository via a SSH key (more complicated): See chapters 10 and 11 of this tutorial

Writing functions

PAGE IS UNDER CONSTRUCTION

Resources

R on network drives

Overview

Using R on network or “company” shared drives can be extremely frustrating. This page contains approaches, common errors, and suggestions on troubleshooting, including for the particularly delicate situations involving Rmarkdown.

Using R on Network Drives: Overarching principles

  1. Must have administrator access on your computer. Setup RStudio specifically to run as administrator.
  2. Use your “\" package library as little as possible, save packages to”C:" library when possible.
  3. the rmarkdown package must not be in a "\" library, as then it can’t talk to TinyTex or Pandoc.

Preparation

Using R on Network Drives: Overarching principles

  1. Must have administrator access on your computer. Setup RStudio specifically to run as administrator.
  2. Use your “\" package library as little as possible, save packages to”C:" library when possible.
  3. the rmarkdown package must not be in a "\" library, as then it can’t talk to TinyTex or Pandoc.

Useful commands

# Find libraries
.libPaths()                   # Your library paths, listed in order that R installs/searches. 
                              # Note: all libraries will be listed, but to install to some (e.g. C:) you 
                              # may need to be running RStudio as an administrator (it won't appear in the 
                              # install packages library drop-down menu) 

# Switch order of libraries
# this can effect the priority of R finding a package. E.g. you may want your C: library to be listed first
myPaths <- .libPaths() # get the paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them

# Find Pandoc
Sys.getenv("RSTUDIO_PANDOC")  # Find where RStudio thinks your Pandoc installation is

# Find a package
# gives first location of package (note order of your libraries)
find.package("rmarkdown", lib.loc = NULL, quiet = FALSE, verbose = getOption("verbose")) 

Troubleshooting common errors

“Failed to compile…tex in rmarkdown”

check/install tinytex, to C: location

# check/install tinytex, to C: location
tinytex::install_tinytex()
tinytex:::is_tinytex() # should return TRUE (note three colons)

Internet routines cannot be loaded

For example, “Error in tools::startDynamicHelp() : internet routines cannot be loaded”

  • Try selecting 32-bit version from RStudio via Tools/Global Options.
    • note: if 32-bit version does not appear in menu, make sure not using RStudio v1.2.
  • Or try uninstalling R and re-installing with different bit (32 instead of 64)

C: library does not appear as an option when I try to install packages manually

  • Must run RStudio as an administrator, then it will appear.
  • To set-up RStudio to always run as administrator (advantageous when using an Rproject where you don’t click RStudio icon to open)… right-click the Rstudio icon, open properties, compatibility, and click the checkbox Run as Administrator.

Pandoc 1 error

If you are getting pandoc error 1 when knitting Rmarkdowns on network drives:

myPaths <- .libPaths() # get the library paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them

Pandoc Error 83 (can’t find file…rmarkdown…lua…)
This means that it was unable to find this file.

See https://stackoverflow.com/questions/58830927/rmarkdown-unable-to-locate-lua-filter-when-knitting-to-word

Possibilities:

  1. Rmarkdown package is not installed
  2. Rmarkdown package is not findable
  3. an admin rights issue.

R is not able to find the ‘rmarkdown’ package file, so check which library the rmarkdown package lives. If it is in a library that in inaccessible (e.g. starts with "\") consider manually moving it to C: or other named drive library.
But be aware that the rmarkdown package has to be able to reach tinytex, so rmarkdown package can’t live on a network drive.

Pandoc Error 61 For example: “Error: pandoc document conversion failed with error 61”

“Could not fetch…”

  • Try running RStudio as administrator (right click icon, select run as admin, see above instructions)
  • Also see if the specific package that was unable to be reached can be moved to C: library.

LaTex error (see below)

“! Package pdftex.def Error: File `cict_qm2_2020-06-29_files/figure-latex/unnamed-chunk-5-1.png’ not found: using draft setting.”

“Error: LaTeX failed to compile file_name.tex.”
See https://yihui.org/tinytex/r/#debugging for debugging tips. See file_name.log for more info.

Pandoc Error 127 This could be a RAM (space) issue. Re-start your R session and try again.

Mapping network drives

How does one open a file “through a mapped network drive”?

  • First, you’ll need to know the network location you’re trying to access.
  • Next, in the Windows file manager, you will need to right click on “This PC” on the right hand pane, and select “Map a network drive”.
  • Go through the dialogue to define the network location from earlier as a lettered drive.
  • Now you have two ways to get to the file you’re opening. Using the drive-letter path should work.

From: https://stackoverflow.com/questions/48161177/r-markdown-openbinaryfile-does-not-exist-no-such-file-or-directory/55616529?noredirect=1#comment97966859_55616529

ISSUES WITH HAVING A SHARED LIBRARY LOCATION ON NETWORK DRIVE

Error in install.packages()

Try removing… /../…/00LOCK (directory)

  • Manually delete the 00LOCK folder directory from your package the library. Try installing again.
  • You can try the command pacman::p_unlock() (you can also put this command in the Rprofile so it runs every time project opens.)
  • Then try installing the package again. It may take several tries.
  • If all else fails, install the package to another library and then manually copy it over.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.